【问题标题】:Getting a memory error when parsing a large XML file in Python在 Python 中解析大型 XML 文件时出现内存错误
【发布时间】:2013-06-19 13:47:39
【问题描述】:

我的 XML 文件如下所示:

<root>
<group from="1", to="100">
    <link target="1"/>
    ...
    <link target="100"/>
</group>
...
</root>

我有 6000 个 &lt;group&gt; 元素和 5M &lt;link&gt; 元素。我想要一个以元组 (from, to) 作为键的字典和一个 &lt;link&gt;s 的 target 属性列表,但我收到以下代码的内存错误:

from lxml import etree
from gzip import open as gopen

def extractTargets(fin):
    targets = dict()

    with gopen(fin) as xml:
        context = etree.iterparse(xml, tag="group")

        for event, elem in context:
            targets[(elem.get("from"), elem.get("to"))] = elem.xpath("link/@target")
            elem.clear()

            while elem.getprevious() is not None:
                del elem.getparent()[0]
        del context

【问题讨论】:

  • 也许您也需要从xpath() 结果集中提取target 属性值?您仍然通过父指针 IIRC 持有对树的引用,因此您希望尽快摆脱 any ElementTree 对象。
  • 我认为使用 SAX 是一种选择,因为我不需要内存中的整个树。

标签: python xml xml-parsing lxml


【解决方案1】:

我今天遇到了同样的问题,对我来说,在我删除“tag”参数后它就起作用了:

context = etree.iterparse(xml)

for event, elem in context:
        if elem.tag = "group":
            targets[(elem.get("from"), elem.get("to"))] = elem.xpath("link/@target")
        elem.clear()

        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

【讨论】:

    【解决方案2】:

    试试下面的代码:

    lxml.etree

    import lxml.etree
    from gzip import open as gopen
    
    class GroupDictTarget(object):
        def __init__(self, d):
            self.d = d
        def start(self, tag, attrib):
            if tag == 'group':
                self.group = self.d[attrib['from'], attrib['to']] = []
            elif tag == 'link':
                self.group.append(attrib['target'])
        def close(self):
            pass
    
    def extractTargets(fin):
        with gopen(fin) as xml:
            targets = {}
            parser = lxml.etree.XMLParser(target=GroupDictTarget(targets))
            lxml.etree.parse(xml, parser)
            return targets
    

    xml.parsers.expat

    import xml.parsers.expat
    from gzip import open as gopen
    
    class GroupDictTarget(object):
        # Same as above
    
    def extractTargets(fin):
        targets = {}
        p = xml.parsers.expat.ParserCreate()
        p.StartElementHandler = GroupDictTarget(targets).start
        with gopen(fin) as f:
            p.ParseFile(f)
        return targets
    

    xml.sax

    import xml.sax
    from gzip import open as gopen
    
    class GroupDictTarget(object):
        # Same as above
    
    def extractTargets(fin):
        targets = {}
        handler = xml.sax.handler.ContentHandler()
        handler.startElement = GroupDictTarget(targets).start
        with gopen(fin) as f:
            xml.sax.parse(f, handler)
        return targets
    

    【讨论】:

    • 哇,更简单更高效!正是我想要的,为了比较:正则表达式解析需要 25 秒,而你的 etree 示例只需要 7 秒。
    猜你喜欢
    • 1970-01-01
    • 2012-11-09
    • 1970-01-01
    • 2014-05-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-07-08
    • 2020-11-19
    相关资源
    最近更新 更多