在 Python 3 中使用正则表达式跳过 XML 元素答案

【问题标题】：Skipping XML elements using Regular Expressions in Python 3在 Python 3 中使用正则表达式跳过 XML 元素
【发布时间】：2018-11-10 05:53:06
【问题描述】：

我有一个 XML 文档，我希望在其中提取特定标签中包含的某些文本，例如 -

<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>

<bdy>
some text
</bdy>

在这个玩具示例中，如果我想在 Python 3 中使用以下正则表达式代码提取标签中包含的所有文本-

# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()

title_text = re.findall(r'<title>.+</title>', xml_doc)

if title_text:
    print("\nMatches found!\n")
    for title in title_text:
        print(title)
else:
    print("\nNo matches found!\n\n")

它为我提供了 XML 标记中的文本以及标记。单个输出的示例是-

<title>Four-minute warning</title>

我的问题是，我应该如何在 re.findall() 或 re.search() 方法中构建模式，以便跳过标签，我得到的只是它们之间的文本。

感谢您的帮助！

【问题讨论】：

Don't use regex to parse XML.
我想我不得不使用正则表达式来解析 XML 文件，因为 XML 文件包含多个根节点/元素（文档根）。结果，ElementTree 抛出错误。
您可以将文件读取为 sting 并将内容包装到根标签中。 valid_xml = f'<document>{xml_file_contents}</document>'。然后将结果用作 ElementTree 的输入。
@Arun，Johan 告诉您不要使用正则表达式来解析 XML，因为 XML 不是常规语言。您可以假设您的语言是常规语言（并且您将获得一个有效的正则表达式），前提是您从未处理过一对<title>...</title> 标记中的任何<title> 标记，这XML 允许。另一方面，XML 语法过于复杂，无法使用简单的正则表达式来隔离所有可能的<title> 标签（例如<title xmlns:blabla="...">）

标签： python regex

【解决方案1】：

只需在您的正则表达式中使用一个捕获组（在这种情况下，re.findall() 负责其余部分）。例如：

import re

s = '<title>Four-minute warning</title>'

title_text = re.findall(r'<title>(.+)</title>', s)

print(title_text[0])
# OUTPUT
# Four-minute warning

【讨论】：

@mypetlion 您有权评论 OP 的利益或未来的读者，除非您对输入 XML 的构造方式有相当完整的了解，否则正则表达式通常不是解析 XML 的最佳工具。否则，请查看ElementTree 或类似内容。