需要解析一个xml文件并构造一个excel工作簿答案

【问题标题】：Need to parse an xml file and construct an excel workbook需要解析一个xml文件并构造一个excel工作簿
【发布时间】：2020-08-10 10:54:37
【问题描述】：

我需要解析一个 XML 文件，然后准备一个具有特定格式的工作簿。要在 excel 表中创建列，我使用的是 YAML 文件。 YAML 文件长这样，

Sheet1:
    1:
        - Country Name: ./country@name #this should be a unique value
        - Description: ./country@descr
        - Neighbor: ./country/neighbor@name
    2:
        - Country Name: ./country@name #this should be a unique value
        - Year: ./country@year

XML 数据：

<data>
    <country name="Liechtenstein" descr="TT">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama" desc="RR">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

进入主题：我试图解析 xml 数据，然后从中创建一个 Dataframe。此数据框将写入 Excel 工作簿。我使用 YAML 文件中的值作为 ElementTree 中 findall(./country) 和 .get(name) 方法的输入。当我在示例中具有相同数量的邻居数据时，一切正常。但我没有。我目前正在将列数据填充为列表。我知道这是错误的。我想知道是否有更好的方法来插入 NaN/None，如下所示，

这就是我得到的，

Sheet1
    Country Name Description     Neighbor
0  Liechtenstein          TT      Austria
1      Singapore        None  Switzerland
2         Panama        None     Malaysia
3            NaN         NaN   Costa Rica
4            NaN         NaN     Colombia

这就是我需要的

Sheet1
    Country Name Description     Neighbor
0  Liechtenstein          TT        Austria
1. Liechtenstein          TT    Switzerland
1      Singapore        None       Malaysia
2         Panama        None     Costa Rica
3         Panama        None       Colombia

编辑：YAML 文件可以有更多的列名，这需要动态输入到 excel 表中。

【问题讨论】：

标签： python xml dataframe

【解决方案1】：

执行以下操作将您的数据放入数据框，然后您可以从那里获取数据：

import pandas as pd
import lxml.html as lh
countries = """[your html above - fixed (a closing " is missing in the first descr value)]"""

doc = lh.fromstring(countries)
rows = []
cols = ["Country Name", "Description",  "Neighbor"]
for country in doc.xpath('//country'):    
    for neighbor in country.xpath('.//neighbor/@name'):
        row = []
        row.append(country.xpath('@name')[0])
        if len(country.xpath('@descr'))>0:
            row.append(country.xpath('@descr')[0])
        else:
            row.append("None")
        row.append(neighbor)
        rows.append(row)
pd.DataFrame(rows, columns=cols)

输出：

   Country Name Description Neighbor
0   Liechtenstein   TT  Austria
1   Liechtenstein   TT  Switzerland
2   Singapore      None     Malaysia
3   Panama        None  Costa Rica
4   Panama       None   Colombia

【讨论】：