解析 XML：在没有循环的情况下查找元素子树答案

【问题标题】：Parsing XML: find element sub-tree without a loop解析 XML：在没有循环的情况下查找元素子树
【发布时间】：2020-02-22 03:06:39
【问题描述】：

我正在使用 ElementTree 解析 XML 有效负载。我无法共享确切的代码或文件，因为它共享敏感信息。我能够通过迭代一个元素（如 ElementTree 文档中所见）并将输出附加到列表中成功地提取我需要的信息。例如：

list_col_name = []
list_col_value = []

for col in root.iter('my_table'):
    # get col name
    col_name = col.find('col_name').text
    list_col_name.append(col_name
    # get col value
    col_value = col.find('col_value').text
    list_col_value.append(col_value)

我现在可以将它们放入字典中，然后继续其余需要完成的工作：

dict_ = dict(zip(list_col_name, list_col_value))

但是，我需要尽快完成此操作，并且想知道是否有一种方法可以一次提取 list_col_name（即使用 findall() 或类似的东西）。如果可能的话，只是好奇如何提高 xml 解析的速度。感谢所有答案/建议。提前谢谢你。

【问题讨论】：

标签： python xml xml-parsing elementtree

【解决方案1】：

我的建议是对源文件使用“增量”解析，基于 iterparse 方法。原因是你实际上：

不需要任何完整解析的 XML 树，
在增量解析期间，您可以丢弃已处理的元素，因此对内存的需求也更小。

另一个提示是使用 lxml 库，而不是 ElementTree。原因是虽然 iterparse 方法存在于 both 这库，但 lxml 版本有额外的 tag 参数，所以你能够“限制”循环只处理感兴趣的标签。

作为我使用的源文件（类似）：

<root>
  <my_table id="t1">
    <col_name>N1</col_name>
    <col_value>V1</col_value>
    <some_other_stuff>xx1</some_other_stuff>
  </my_table>
  <my_table id="t2">
    <col_name>N2</col_name>
    <col_value>V2</col_value>
    <some_other_stuff>xx1</some_other_stuff>
  </my_table>
  <my_table id="t3">
    <col_name>N3</col_name>
    <col_value>V3</col_value>
    <some_other_stuff>xx1</some_other_stuff>
  </my_table>
</root>

其实我的源文件：

包括9 my_table 元素（不是3），
some_other_stuff 重复 8 次（在每个 my_table 中），以模拟每个 my_table 中包含的其他元素。

我使用 %timeit 进行了 3 次测试：

您的循环，预先解析源 XML 文件：

from lxml import etree as et

def fn1():
    root = et.parse('Tables.xml')
    list_col_name = []
    list_col_value = []
    for col in root.iter('my_table'):
        col_name = col.find('col_name').text
        list_col_name.append(col_name)
        col_value = col.find('col_value').text
        list_col_value.append(col_value)
    return dict(zip(list_col_name, list_col_value))

执行时间为 1.74 毫秒。

我的循环，基于 iterparse，只处理“需要”的元素：
```
def fn2():
    key = ''
    dict_ = {}
    context = et.iterparse('Tables.xml', tag=['my_table', 'col_name', 'col_value'])
    for action, elem in context:
        tag = elem.tag
        txt = elem.text
        if tag == 'col_name':
            key = txt
        elif tag == 'col_value':
            dict_[key] = txt
        elif tag == 'my_table':
            elem.clear()
            elem.getparent().remove(elem)
    return dict_
```
我假设在每个 my_table 元素中 col_name 出现在 before col_value 并且每个 my_table 只包含一个名为 col_name 的孩子和 col_value。

还要注意，上述函数会清除每个 my_table 元素和从解析的 XML 树中删除它（getparent 函数可用仅在 lxml 版本中）。

另一个改进是我“直接”添加每个 key / value 对到此函数要返回的字典，因此不需要 zip。

执行时间为 1.33 毫秒。不是很快，但至少有一些时间增益是可见的。

您还可以读取所有 col_name 和 col_value 元素，调用 findall 然后调用 zip:

def fn3():
    root = et.parse('Tables.xml')
    list_col_name = []
    for elem in root.findall('.//col_name'):
        list_col_name.append(elem.text)
    list_col_value = []
    for elem in root.findall('.//col_value'):
        list_col_value.append(elem.text)
    return dict(zip(list_col_name, list_col_value))

执行时间为 1.38 毫秒。还有比你原来的更快的东西解决方案，但与我的第一个解决方案没有显着差异（fn2）。

当然，最终结果很大程度上取决于：

输入文件的大小，
每个 my_table 元素包含多少“其他内容”。

【讨论】：

【解决方案2】：

考虑使用findall 进行列表理解，以避免列表初始化/追加和显式for 循环可能marginally improve performance：

# FINDALL LIST COMPREHENSION
list_col_name = [e.text for e in root.findall('./my_table/col_name')]
list_col_value = [e.text for e in root.findall('./my_table/col_value')]

dict(zip(list_col_name, list_col_value))

或者，对于完全支持 XPath 1.0 的 lxml（第三方库），可以考虑 xpath()，它可以将解析输出直接分配给列表，同时避免初始化/追加和 for 循环：

import lxml.etree as et
...

# XPATH LISTS
list_col_name = root.xpath('my_table/col_name/text()')
list_col_value = root.xpath('my_table/col_value/text()')

dict(zip(list_col_name, list_col_value))

【讨论】：

【解决方案3】：

不知道有没有你想要的。

from simplified_scrapy import SimplifiedDoc
html = '''
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
'''
doc = SimplifiedDoc(html)
ranks = doc.selects('country>(rank>text())')
print (ranks)
ranks = doc.selects('country>rank()')
print (ranks)
ranks = doc.selects('country>children()')
print (ranks)

结果：

['1', '4', '68']
[{'tag': 'rank', 'html': '1'}, {'tag': 'rank', 'html': '4'}, {'tag': 'rank', 'html': '68'}]
[[{'tag': 'rank', 'html': '1'}, {'tag': 'year', 'html': '2008'}, {'tag': 'gdppc', 'html': '141100'}, {'name': 'Austria', 'direction': 'E', 'tag': 'neighbor'}, {'name': 'Switzerland', 'direction': 'W', 'tag': 'neighbor'}], [{'tag': 'rank', 'html': '4'}, {'tag': 'year', 'html': '2011'}, {'tag': 'gdppc', 'html': '59900'}, {'name': 'Malaysia', 'direction': 'N', 'tag': 'neighbor'}], [{'tag': 'rank', 'html': '68'}, {'tag': 'year', 'html': '2011'}, {'tag': 'gdppc', 'html': '13600'}, {'name': 'Costa Rica', 'direction': 'W', 'tag': 'neighbor'}, {'name': 'Colombia', 'direction': 'E', 'tag': 'neighbor'}]]

【讨论】：