【问题标题】：How to loop through a complicated XML structure in order to transform it to a pandas data frame如何遍历复杂的 XML 结构以将其转换为 pandas 数据框
【发布时间】：2017-11-23 10:34:35
【问题描述】：

我正在尝试从 XML 文件中提取信息并将其转换为以下 XML 结构的 pandas 数据框：

<change user="123" timestamp="2017-09-04T13:58:46.190Z">
    <log id="333" action="create">
        <property id="52122">
            <old/>
            <new>
                <item id="562622" toString="Test"/>
                <item id="033362" toString="Test2"/>
            </new>
        </property>
        <property id="33563">
            <new>
                <item id="44322" toString="Test3"/>
            </new>
        </property>
        <property id="21733">
            <old/>
            <new id="12341212" toString="Test4"/>
        </property>
    </log>
</change>

以下是数据框中列的预期标题：

Change_User|Timestamp|Log_id|Action|property_ID|New_Property_ID|Item_ID|To_String

我之前用 MiniDom 尝试过，但是太糟糕了。现在我正在尝试使用 xml-elementree。

我如何编写代码以循环遍历整个更改元素，直到 item-id 没有重复？

我需要这样的东西：

for test in root.iter('change'):
change_user_id.append(test.attrib['user'])
timestamp.append(test.attrib['timestamp'])
for log in test:
    log_id.append(log.attrib['id'])
    action.append(log.attrib['action'])
    #now comes the part where i get duplicates and wrong order of the following values...

    #after some logic...

d = {'changer_user':change_user_id,'timestamp':timestamp,'log_id':log_id,'action':action#and so on...}


a = pd.DataFrame.from_dict(d, orient='index')

【问题讨论】：

为什么new 的id 和toString 属性出现在第三个property 中，而前两个却没有（而不是item 的属性）？
我编辑结束标签以进行更改，是的，这就是我遇到的一些问题。它是带有其他值的原始文件。所以这不是一个错误。
为什么<old/> (empty) 在第二个中没有，而在其他两个中没有？
因为系统中没有“旧”信息放在这里。此 XML 文件表示在系统中创建新对象。
我认为我需要一个 4 或 5 级循环来捕获所有值，并检查 old-tag 是否为空。

标签： python xml elementtree

【解决方案1】：

不确定您要做什么，但这应该可以帮助您入门：

import xmltodict

with open('change_user.xml') as fd:
    doc = xmltodict.parse(fd.read())  

doc['change']['log'] #use tags to maneuver through dicts

打印：

OrderedDict([('@id', '333'),
             ('@action', 'create'),
             ('property',
              [OrderedDict([('@id', '52122'),
                            ('old', None),
                            ('new',
                             OrderedDict([('item',
                                           [OrderedDict([('@id', '562622'),
                                                         ('@toString',
                                                          'Test')]),
                                            OrderedDict([('@id', '033362'),
                                                     ('@toString',
                                                      'Test2')])])]))]),
           OrderedDict([('@id', '33563'),
                        ('new',
                         OrderedDict([('item',
                                       OrderedDict([('@id', '44322'),
                                                    ('@toString',
                                                     'Test3')]))]))]),
           OrderedDict([('@id', '21733'),
                        ('old', None),
                        ('new',
                         OrderedDict([('@id', '12341212'),
                                      ('@toString', 'Test4')]))])])])

来源：http://docs.python-guide.org/en/latest/scenarios/xml/

【讨论】：

感谢您的帮助，但字典列表不是我想要的。

【解决方案2】：

这是您可以进一步进行的方式，我以两列为例，其余的您可以自己弄清楚

第一步

使用 ElementTree 解析 xml

import xml.etree.ElementTree as ET
import datetime as date

def output_xml_parsing(xml):
    xml_data=open(xml).read()
    root= ET.XML(xml_data)
    Change_User=root.attrib.get('user')
    timestamp=root.attrib.get('timestamp')
    return Change_User,timestamp

第 2 步

创建一个数据框并为其添加值，此示例只有两列，但您可以进一步扩展它

def add_data_to_dataframe(xml):
    import pandas as pd
    #This will create an empty dataframe with two columns
    report_dataframe=pd.DataFrame(columns=['Change_User','timestamp'],index=[date])
    #Returned value from above function would be stored in Change_user,timestamp
    Change_User,timestamp=output_xml_parsing(xml)

    #Dictionary which will populate the data in data frame, key is column name and value is value returned from previous function

   data={
        'Change_User':[Change_User],
        'timestamp':[timestamp]
        }
    #DataFrame would be populated by below command
    report_dataframe=pd.DataFrame(data,index=[date])
    return report_dataframe

第三步

调用函数

ab=add_data_to_dataframe(r'D:\Users\pankaj-m\Desktop\Stack overflow questions\xml\data.xml')
print ab

【讨论】：

嗨，我知道如何获取第一个元素。我仍然更感兴趣如何通过关卡直到项目 id 没有重复和循环。
那你应该修改你的问题，真的很混乱