【问题标题】:DATEXII XML file to DataFrame in PythonDATEXII XML 文件到 Python 中的 DataFrame
【发布时间】:2018-04-30 01:56:40
【问题描述】:

最近几天我一直在尝试打开和读取某个 XML 文件(DATEXII 格式),但到目前为止还没有成功。它是关于来自NDW Open Data website(荷兰道路和交通数据数据库)的交通数据,XML 文件源的超链接。树的头部就像in this picture 并继续like this,另见下面的sn-p。尽管这些加在一起只占数据的很小一部分。

<?xml version="1.0"?> -
<soapenv:Envelope xmlns:_0="http://datex2.eu/schema/2/2_0" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
  <soapenv:Header/> -
  <soapenv:Body>
    -
    <d2LogicalModel modelBaseVersion="2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      -
      <exchange xmlns="http://datex2.eu/schema/2/2_0">
        -
        <supplierIdentification>
          <country>nl</country>
          <nationalIdentifier>NLNDW</nationalIdentifier>
        </supplierIdentification>
      </exchange>
      -
      <payloadPublication lang="nl" xmlns="http://datex2.eu/schema/2/2_0" xsi:type="MeasuredDataPublication">
        <publicationTime>2017-10-30T05:00:40.007Z</publicationTime>
        -
        <publicationCreator>
          <country>nl</country>
          <nationalIdentifier>NLNDW</nationalIdentifier>
        </publicationCreator>
        <measurementSiteTableReference targetClass="MeasurementSiteTable" version="955" id="NDW01_MT" /> -
        <headerInformation>
          <confidentiality>noRestriction</confidentiality>
          <informationStatus>real</informationStatus>
        </headerInformation>
        -
        <siteMeasurements>
          <measurementSiteReference targetClass="MeasurementSiteRecord" version="1" id="PZH01_MST_0690_00" />
          <measurementTimeDefault>2017-10-30T04:59:00Z</measurementTimeDefault>
          -
          <measuredValue index="1">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>60</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="2">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>0</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="3">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>0</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="4">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>60</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="5">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficSpeed">
                -
                <averageVehicleSpeed numberOfInputValuesUsed="1">
                  <speed>38</speed>
                </averageVehicleSpeed>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="6">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficSpeed">
                -
                <averageVehicleSpeed numberOfInputValuesUsed="0">
                  <speed>-1</speed>
                </averageVehicleSpeed>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="7">

理想情况下,我希望在 Jupyter Notebook 中使用 Python 作为 DataFrame 加载信息,以便在数据允许的情况下执行一些预测分析。我已经尝试过 ElementTree,像这样的 lxml,灵感来自许多其他线程:

# Standard Packages
import pandas as pd
import numpy as np

# Necessary Packages for XML and setting Working Directory
import os
import xml.etree.ElementTree as ET
import lxml

os.chdir("C:/.../Intensiteiten en snelheden/30-10-2017")

xml_file = open('0600_Trafficspeed.xml').read() # Unzipped the file manually

def xml2df(xml_data):
    root = ET.XML(xml_data) # element tree
    all_records = [] #This is our record list which we will convert into a 
    dataframe
    for i, child in enumerate(root): #Begin looping through our root tree
        record = {} #Place holder for our record
        for subchild in child: #iterate through the subchildren
            record[subchild.tag] = subchild.text #Extract the text create a new 
    dictionary key, value pair
        all_records.append(record) #Append this record to all_records.
return pd.DataFrame(all_records) #return records as DataFrame

print(xml2df(xml_file))

虽然这只会返回一个带有第一行的条目,例如列名:d2LogicalModel,行:0,条目:无。

我在 Microsoft Edge 中很难看到树状结构,需要大量 CPU(Notepad++ 和插件 XMLtools 也足够了,但会因“更大”大小的文件(即 > 20mb)而崩溃)。虽然,在我看来,这种结构仍然难以理解。层太多,我不知道如何用正确的子子子项等定义xml2df()

因此,我的问题首先归结为,我如何能够用数据识别变量/列?特此概述我要导入的相关数据。其次,如何将其导入DataFrame?

注意:由于 DATEXII 格式是欧洲交通数据的标准,我希望他们的指南会有所帮助(请参阅 documents),但它们对我来说还没有意义。也许他们会对你们中的任何一个人:)

非常感谢任何帮助!

【问题讨论】:

  • 发布完整代码,包括导入。另外,请举例说明您的函数应该返回什么。一个作为建议,暂时将 pandas 排除在外,专注于将数据提取到 pandas 可以处理的 json 或 csv 中。这样,工作和帮助会更轻松。
  • @IgnacioVergaraKausel 感谢您的评论。我已经编辑并包含了完整的代码。我同意,目前提取是最重要的,但问题是,我首先需要对数据有一个概述,以便知道我想要什么作为回报。到目前为止,我希望遍历整个文档可以为我提供所有数据,之后我可以过滤和选择。
  • 另一项:请下载完整的 XML sn-p(包括根标签)并将示例发布在您的问题正文中。不要让我们(无偿志愿者)按照您对非英语网站 API 的说明进行下载。
  • @Parfait Fair 点,更新包含两个 XML 图像的链接。希望这会有所帮助。
  • 不要post screenshots,因为我们无法复制粘贴到我们的环境中。帮助我们帮助您。不要给我们带来负担。我们只需要足够的 XML sn-p(不是全部)来理解模式。

标签: python xml dataframe lxml elementtree


【解决方案1】:

考虑使用XSLT 将嵌套的 XML 输入源转换为更扁平的结构,该语言旨在将 XML 文件转换为其他 XML、HTML 甚至文本 (CSV/TAB)。因此,请考虑以下 XSLT,它将原始 XML 转换为表格格式的逗号分隔值,以便使用 read_csv() 导入 pandas:

XSLT (另存为.xsl文件,特殊的xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
                              xmlns:pub="http://datex2.eu/schema/2/2_0"
                              xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="/soapenv:Envelope">
    <xsl:text>publicationTime,country,nationalIdentifier,msmtSiteTableRef_targetClass,msmtSiteTableRef_version,msmtSiteTableRef_id,</xsl:text>
    <xsl:text>msmtSiteRef_targetClass,msmtSiteRef_version,msmtSiteRef_id,measurementTimeDefault,</xsl:text>
    <xsl:text>measuredValue_index,basicData_type,vehicleFlowRate,averageVehicleSpeed_numberOfInputValues,averageVehicleSpeed_value</xsl:text>
    <xsl:text>&#xa;</xsl:text>
    <xsl:apply-templates select="soapenv:Body"/>
  </xsl:template>

  <xsl:template match="soapenv:Body">
    <xsl:apply-templates select="d2LogicalModel"/>
  </xsl:template>

  <xsl:template match="d2LogicalModel">
    <xsl:apply-templates select="pub:payloadPublication"/>
  </xsl:template>

  <xsl:template match="pub:payloadPublication">
    <xsl:apply-templates select="pub:siteMeasurements"/>
  </xsl:template>

  <xsl:template match="pub:siteMeasurements">
    <xsl:apply-templates select="pub:measuredValue"/>
  </xsl:template>

  <xsl:template match="pub:measuredValue">
    <xsl:value-of select="concat(ancestor::pub:payloadPublication/pub:publicationTime,',',
                                 ancestor::pub:payloadPublication/pub:publicationCreator/pub:country,',',
                                 ancestor::pub:payloadPublication/pub:publicationCreator/pub:nationalIdentifier,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@targetClass,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@version,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@id,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@targetClass,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@version,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@id,',',
                                 ancestor::pub:siteMeasurements/pub:measurementTimeDefault,',',
                                 @index,',',
                                 pub:measuredValue/pub:basicData/@xsi:type,',',
                                 descendant::pub:vehicleFlowRate,',',
                                 descendant::pub:averageVehicleSpeed/@numberOfInputValuesUsed,',',
                                 descendant::pub:speed)"/><xsl:text>&#xa;</xsl:text>    
  </xsl:template>

</xsl:stylesheet>

Python

from io import StringIO
import lxml.etree as et
import pandas as pd

# LOAD XML AND XSL FILES
doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/XSLT.xsl')

# INITIALIZE AND RUN TRANSFORMATION
transform = et.XSLT(xsl)
# CONVERT RESULT TO STRING 
result = str(transform(doc))

# IMPORT INTO DATAFRAME
df = pd.read_csv(StringIO(result))

输出 (父节点值变成重复指标,不同数值数据)

print(df)

#           publicationTime country nationalIdentifier msmtSiteTableRef_targetClass  msmtSiteTableRef_version msmtSiteTableRef_id msmtSiteRef_targetClass  msmtSiteRef_version     msmtSiteRef_id measurementTimeDefault  measuredValue_index basicData_type  vehicleFlowRate  averageVehicleSpeed_numberOfInputValues  averageVehicleSpeed_value
# 0  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    1    TrafficFlow             60.0                                      NaN                        NaN
# 1  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    2    TrafficFlow              0.0                                      NaN                        NaN
# 2  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    3    TrafficFlow              0.0                                      NaN                        NaN
# 3  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    4    TrafficFlow             60.0                                      NaN                        NaN
# 4  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    5   TrafficSpeed              NaN                                      1.0                       38.0
# 5  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    6   TrafficSpeed              NaN                                      0.0                        1.0

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2020-07-08
    • 2019-07-31
    • 1970-01-01
    • 2021-09-23
    • 2018-11-19
    • 1970-01-01
    • 2017-07-26
    • 2019-03-28
    相关资源
    最近更新 更多