从 XML Wiki 转储中检索所有文章标题 - Python答案

【问题标题】：Retrieving All Articles Titles from an XML Wiki Dump - Python从 XML Wiki 转储中检索所有文章标题 - Python
【发布时间】：2026-02-06 03:25:01
【问题描述】：

我有一个通过导出某个类别的所有页面创建的 Wikipedia XML 转储。您可以通过在https://en.wikipedia.org/wiki/Special:Export 为自己生成一个 XML 文件来查看此 XML 文件的确切结构。现在我想用 Python 列出每篇文章的标题。我试过使用：

import xml.etree.ElementTree as ET

tree = ET.parse('./comp_sci_wiki.xml')
root = tree.getroot()

for element in root:
    for sub in element:
        print sub.find("title")

没有打印任何内容。这似乎应该是一个相对简单的任务。您能提供的任何帮助将不胜感激。谢谢！

【问题讨论】：

标签： python xml xml-parsing

【解决方案1】：

如果您查看导出文件的开头，您会看到该文档声明了一个默认的 XML 命名空间：

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLo

这意味着文档中没有未命名空间的“标题”元素，这是您的 sub.find("title") 语句失败的原因之一。如果您要打印出您的 root 元素，您可以看到这一点：

>>> print root
<Element '{http://www.mediawiki.org/xml/export-0.10/}mediawiki' at 0x7f2a45df6c10>

请注意，它没有说<Element 'mediawiki'>。标识符包括完整的命名空间。 This document 详细描述了如何在 XML 文档中使用命名空间，但 tl;dir 版本是您需要的：

>>> from xml.etree import ElementTree as ET
>>> tree=ET.parse('/home/lars/Downloads/Wikipedia-20160405005142.xml')
>>> root = tree.getroot()
>>> ns = 'http://www.mediawiki.org/xml/export-0.10/
>>> for page in root.findall('{%s}page' % ns):
...   print (page.find('{%s}title' % ns).text)
... 
Category:Wikipedia books on computer science
Computer science in sport
Outline of computer science
Category:Unsolved problems in computer science
Category:Philosophy of computer science
[...etc...]
>>>

如果您要安装，您的生活可能会更轻松 lxml 模块，其中包括完整的 xpath 支持，允许您做这样的事情：

>>> nsmap={'x': 'http://www.mediawiki.org/xml/export-0.10/'}
>>> for title in tree.xpath('//x:title', namespaces=nsmap):
...   print (title.text)
... 
Category:Wikipedia books on computer science
Computer science in sport
Outline of computer science
Category:Unsolved problems in computer science
Category:Philosophy of computer science
Category:Computer science organizations
[...etc...]

无论如何，请通读有关命名空间支持的文档，希望如此加上这些例子将为您指明正确的方向。这外卖应该是 XML 命名空间很重要，而 title 在一个命名空间与另一个命名空间中的title 不同。

【讨论】：