使用 python 增量解析大型维基百科转储 XML 文件答案

【问题标题】：Parsing incrementally a large wikipedia dump XML file using python使用 python 增量解析大型维基百科转储 XML 文件
【发布时间】：2019-03-13 08:06:33
【问题描述】：

目标是从 Wikipedia DUMP（70Gb 文件）中读取所有……内容。这不可能加载到内存中，因此我尝试逐步解析文件并从中获取一些值。然而我刚刚写的脚本并没有打印任何东西，并立即占据了我所有的记忆。

代码如下：

from lxml import etree

def fast_iter(context, func, *args, **kwargs):

    for event, elem in context:
        func(elem, *args, **kwargs)

        elem.clear()

        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context


def process_element(elem):
    #print(elem)
    print (elem.xpath( './revision/text/text( )' ))

context = etree.iterparse( 'enwiki-latest-pages-articles-multistream.xml', tag='page' )
fast_iter(context,process_element)

当这个脚本应用在一个小的 xml 文件中时，它会打印来自请求的 xpath 的值。

但是，当应用于完整文件时，什么也没有发生。

这是来自维基百科转储的相同行

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>https://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.33.0-wmf.19</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia talk</namespace>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace>
      <namespace key="100" case="first-letter">Portal</namespace>
      <namespace key="101" case="first-letter">Portal talk</namespace>
      <namespace key="108" case="first-letter">Book</namespace>
      <namespace key="109" case="first-letter">Book talk</namespace>
      <namespace key="118" case="first-letter">Draft</namespace>
      <namespace key="119" case="first-letter">Draft talk</namespace>
      <namespace key="446" case="first-letter">Education Program</namespace>
      <namespace key="447" case="first-letter">Education Program talk</namespace>
      <namespace key="710" case="first-letter">TimedText</namespace>
      <namespace key="711" case="first-letter">TimedText talk</namespace>
      <namespace key="828" case="first-letter">Module</namespace>
      <namespace key="829" case="first-letter">Module talk</namespace>
      <namespace key="2300" case="first-letter">Gadget</namespace>
      <namespace key="2301" case="first-letter">Gadget talk</namespace>
      <namespace key="2302" case="case-sensitive">Gadget definition</namespace>
      <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title="Computer accessibility" />
    <revision>
      <id>854851586</id>
      <parentid>834079434</parentid>
      <timestamp>2018-08-14T06:47:24Z</timestamp>
      <contributor>
        <username>Godsy</username>
        <id>23257138</id>
      </contributor>
      <comment>remove from category for seeking instructions on rcats</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]]

{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}</text>
      <sha1>42l0cvblwtb4nnupxm6wo000d27t6kf</sha1>
    </revision>
  </page>
  <page>
    <title>Anarchism</title>
    <ns>0</ns>
    <id>12</id>
    <revision>
      <id>885648527</id>
      <parentid>885645378</parentid>
      <timestamp>2019-03-01T11:16:23Z</timestamp>
      <contributor>
        <username>Jarnsax</username>
        <id>33627956</id>
      </contributor>
      <comment>improve citation metadata</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">{{redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}
{{pp-move-indef}}
{{short description|Political philosophy that advocates self-governed societies}}
{{Use dmy dates|date=July 2018}}
{{use British English|date=January 2014}}
{{Anarchism sidebar}}
{{Basic forms of government}}
'''Anarchism''' is an [[anti-authoritarian]] [[political philosophy]]{{sfn|McLaughlin|2007|p=59}}{{sfn|Flint|2009|p=27}} that advocates [[Self-governance|self-governed]] societies based on voluntary, [[cooperative]] institutions and the rejection of coercive [[Hierarchy|hierarchies]] those societies view as unjust. These institutions are often described as [[Stateless society|stateless societies]],{{r|group=note|Note01}}{{sfn|Sheehan|2003|p=85}} although several authors have defined them more specifically as distinct institutions based on non-hierarchical or [[Free association (communism and anarchism)|free associations]].{{r|group=note|Note02}} Anarchism holds the [[State (polity)|state]] to be undesirable, unnecessary, and harmful.{{r|group=note|Note03}}&lt;ref name=definition /&gt; Any philosophy consistent with statelessness, that is, principled opposition to the State, is anarchist, thus anarchist schools of thought range from [[anarcho-communism]] to [[anarcho-capitalism]].{{sfn|Fiala|2018}}

While [[Anti-statism|opposition to the state]] is central,{{r|group=note|Note04}} many forms of anarchism specifically entail opposing authority or hierarchical organisation based on authority in the conduct of all human relations.{{r|group=note|Note05}} Anarchism is often considered a [[Far-left politics|far-left]] ideology,{{r|group=note|Note06}}{{sfn|Kahn|2000}}{{sfn|Moyihan|2007}} and much of [[anarchist economics]] and [[Anarchist law|anarchist legal philosophy]] reflect [[Libertarian socialism|anti-authoritarian interpretations]] of [[Anarcho-communism|communism]], [[Collectivist anarchism|collectivism]], [[Anarcho-syndicalism|syndicalism]], [[Mutualism (economic theory)|mutualism]], or [[participatory economics]].{{r|group=note|Note07}}

Anarchism does not offer a fixed body of doctrine from a single particular world view, instead fluxing and flowing as a philosophy.{{sfn|Marshall|2010|p=16}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.{{sfn|Sylvan|2007|p=262}} [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].{{sfn|McLean|McMillan|2003|loc= Anarchism}} Strains of anarchism have often been divided into the categories of [[Social anarchism|social]] and [[individualist anarchism]] or similar dual classifications.{{sfn|Ostergaard|p=14|loc=Anarchism}}{{sfn|Kropotkin|2002|p=5}}{{sfn|Fowler|1972}}
   </text>
   </revision>
   </page>
</mediawiki>

以前有人做过吗？知道如何有效地解析这个巨大的转储吗？有没有以前做过的包/库？我不想重新发明轮子。

【问题讨论】：

为什么elem.clear() 在那里？这将删除所有子元素，因此后续尝试查找低于 elem 的元素将不会返回任何内容。
你得到了什么输出做？绝对没有？您的代码可能需要多长时间才能读取 70GB - 您等得够久了吗？
@barny 我没有输出，我等了大约 5 分钟。但是我希望，由于我们正在迭代每个“页面”元素，因此应该立即生成输出。还是我的假设是错误的？
你可以试试SAX parser，它是为流媒体设计的。
“那个 elem.clear() 在做什么？” 这会从 Element 中删除所有个子 Element。这用于在解析时不构建 XML-Tree。

标签： python xml xml-namespaces wikipedia iterparse

【解决方案1】：

问题：逐步解析大型维基百科转储 XML 文件
当这个（问题）脚本应用在一个小的 xml 文件中时，它会打印来自请求的 xpath 的值。
但是，当应用于完整文件时，没有任何反应。

我想知道，您可以从 小文件 中得到什么，因为您不使用 namespace 参数。
Wikipedia xml 文件使用以下默认namespace：

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"

本例使用lxml:

from lxml import etree

class Wikipedia:
    def __init__(self, fh, tag):
        """
        Initialize 'iterparse' to only generate 'end' events on tag '<entity>'

        :param fh: File Handle from the XML File to parse
        :param tag: The tag to process
        """
        # Prepend the default Namespace {*} to get anything.
        self.context = etree.iterparse(fh, events=("end",), tag=['{*}' + tag])

    def _parse(self):
        """
        Parse the XML File for all '<tag>...</tag>' Elements
        Clear/Delete the Element Tree after processing

        :return: Yield the current 'Event, Element Tree'
        """
        for event, elem in self.context:
            yield event, elem

            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]

    def __iter__(self):
        """
        Iterate all '<tag>...</tag>' Element Trees yielded from self._parse()

        :return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
        """
        for event, elem in self._parse():
            entity = {}

            # Assign the 'elem.namespace' to the 'xpath'
            entity['revision'] = elem.xpath('./xmlns:revision/xmlns:text/text( )', 
                                   namespaces={'xmlns':etree.QName(elem).namespace})

            yield entity


if __name__ == "__main__":
    XML = b""""""<?xml version='1.0' encoding='UTF-8'?>
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ 
http://www.mediawiki.org/xml/export-0.10.xsd"  
version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    ... (omitted for brevity)""""""

    #with open('.\\FILE.XML', 'rb') as in_xml_
    with io.BytesIO(XML) as in_xml:
        for record in Wikipedia(in_xml, tag='page'):
            print("record:{}".format(record))

输出：

record:{'revision': ['#REDIRECT [[Computer accessi... (omitted for brevity)
record:{'revision': ["{{redirect2|Anarchist|Anarch... (omitted for brevity)

用 Python 测试：3.5 - lxml.etree：3.7.1

【讨论】：

【解决方案2】：

使用 SAX。请参阅下面的示例 (https://www.tutorialspoint.com/python3/python_xml_processing.htm)。

Simple API for XML (SAX) - 在这里，您为感兴趣的事件注册回调，然后让解析器继续处理文档。 当您的文档很大或您有内存限制时，这很有用，它会在从磁盘读取文件时解析文件，并且整个文件永远不会存储在内存中。

SAX 是事件驱动的 XML 解析的标准接口。使用 SAX 解析 XML 通常需要您通过继承 xml.sax.ContentHandler 来创建自己的 ContentHandler。

导入 xml.sax

class MovieHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # Call when an element starts
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print ("*****Movie*****")
         title = attributes["title"]
         print ("Title:", title)

   # Call when an elements ends
   def endElement(self, tag):
      if self.CurrentData == "type":
         print ("Type:", self.type)
      elif self.CurrentData == "format":
         print ("Format:", self.format)
      elif self.CurrentData == "year":
         print ("Year:", self.year)
      elif self.CurrentData == "rating":
         print ("Rating:", self.rating)
      elif self.CurrentData == "stars":
         print ("Stars:", self.stars)
      elif self.CurrentData == "description":
         print ("Description:", self.description)
      self.CurrentData = ""

   # Call when a character is read
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content

if ( __name__ == "__main__"):

   # create an XMLReader
   parser = xml.sax.make_parser()
   # turn off namepsaces
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # override the default ContextHandler
   Handler = MovieHandler()
   parser.setContentHandler( Handler )

   parser.parse("c:\\temp\\movies.xml")

movies.xml

<collection shelf = "New Arrivals">
<movie title = "Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title = "Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title = "Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title = "Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

【讨论】：