由于奇怪的字符，无法解析 XML 文档答案

【问题标题】：XML document can't be parsed due to strange characters由于奇怪的字符，无法解析 XML 文档
【发布时间】：2020-02-20 00:42:55
【问题描述】：

我正在使用 Python 3 从 API 检索数据，但在从检索到的字符串中解析一些 XML 文档时遇到问题。

我已经确定了导致此问题的特定字符串：

from xml.etree import ElementTree

bad_string = '<tag>Sample &#x91;cp 99-3a&#x92</tag>'
ElementTree.fromstring(bad_string)

这是停止脚本的返回错误：

ParseError: not well-formed (invalid token): line 1, column 31

我尝试使用以下解决方案来解决它，结果与以前相同

ElementTree.fromstring('<tag>Sample &#x91;cp 99-3a&#x92</tag>'.encode('ascii', 'ignore'))

如何在不应用一个特定的正则表达式来处理其他类似字符串的情况下清理此字符串？

编辑：既然@b_c 和@mzjn 解释了我的问题是非转义字符，我找到了一种可能的解决方案 (Escape unescaped characters in XML with Python)

ElementTree.fromstring('<tag>&amp;Sample &#x91;cp 99-3a&#x92</tag>', parser = etree.XMLParser(recover = True))

【问题讨论】：

&#x92 是问题所在。如果它的末尾有一个分号 (&#x92;)，它将是一个正确的数字字符引用。见en.wikipedia.org/wiki/…。

标签： python xml-parsing

【解决方案1】：

您的字符串包含 HTML 实体（无论是 XML 还是 HTML）并且需要不转义。 &#x91; 和 &#x92 分别与 ‘ 和 ’ 相关。

如果您use html.unescape，您将看到清理后的文本：

>>> import html
>>> html.unescape('<tag>Sample &#x91;cp 99-3a&#x92</tag>')
'<tag>Sample ‘cp 99-3a’</tag>'

编辑：@mzjn 指出您还可以通过向第二个实体添加缺少的分号来修复字符串：

>>> import xml.etree.ElementTree as ET
>>> tag = ET.fromstring('<tag>Sample &#x91;cp 99-3a&#x92;</tag>')
>>> tag.text
'Sample \x91cp 99-3a\x92'

但是，您会看到仍然有 \x91 和 \x92 字符（并且要求您可以控制字符串的内容）。这些是用于左右单引号的MS CP1252 encodings。使用上面的html.unescape 方法仍会为您提供清理后的文本。

评论跟进

在您的评论中，您添加了包含 other 有效 XML 转义序列（例如 &amp;）的字符串的额外皱纹，html.unescape 会很乐意清除这些皱纹。不幸的是，正如您所看到的，这最终将您带回第一方，因为您现在有一个应该被转义的&amp;，但不是（ElementTree 将取消转义它给你）。

>>> import html
>>> import xml.etree.ElementTree as ET
>>> cleaned = html.unescape('<tag>&amp;Sample &#x91;cp 99-3a&#x92</tag>')
>>> print(cleaned)
<tag>&Sample ‘cp 99-3a’</tag>
>>> ET.fromstring(cleaned)
Traceback (most recent call last):
  ...
ParseError: not well-formed (invalid token): line 1, column 12

您还可以尝试使用lxml.html 中的soupparser，这样可以更好地处理有问题的HTML/XML：

>>> from lxml.html import soupparser
>>> soupparser.fromstring('<tag>&amp;Sample &#x91;cp 99-3 a&#x92;</tag>').text_content()
'&Sample ‘cp 99-3 a’'

或者根据您的需求，您最好在解析字符串/正则表达式之前替换它以删除烦人的 cp1252 字符：

>>> import re
# Matches "&#x91" or "&#x92", with or without trailing semicolon
>>> node = ET.fromstring(re.sub(r'&#x9[1-2];?', "'", '<tag>&amp;Sample &#x91;cp 99-3 a&#x92;</tag>'))
>>> node.text
"&Sample 'cp 99-3 a'"

【讨论】：

分号是无关紧要的，至少对于html.unescape 显然。是的，他们是 HTML Entities
说“它是带有 HTML 实体的 HTML”是一种误导。如果&#x92 之后有分号，则问题中的bad_string 将是格式良好的XML。
非常感谢@b_c 和@mzjn，这两种解决方案都是有效的，但现在我遇到了&amp; 的另一个问题。例如，当我运行 ElementTree.fromstring(html.unescape('<tag>&amp;Sample &#x91;cp 99-3a&#x92</tag>')) 时，我遇到了和以前一样的问题。
在此基础上更新了一些附加选项:)
太棒了！这是我一直在寻找的解决方案，谢谢！