在 Python 2.6 中用对应的 utf-8 字符替换 html 实体答案

【问题标题】：Replace html entities with the corresponding utf-8 characters in Python 2.6在 Python 2.6 中用对应的 utf-8 字符替换 html 实体
【发布时间】：2010-10-18 08:02:13
【问题描述】：

我有一个这样的html文本：

&lt;xml ... &gt;

我想把它转换成可读的东西：

<xml ...>

在 Python 中有什么简单（快速）的方法吗？

【问题讨论】：

我认为这个问题与此重复：stackoverflow.com/questions/57708/…
Decode HTML entities in Python string?的可能重复
最佳方法stackoverflow.com/questions/2360598/…

标签： python html-entities python-2.6

【解决方案1】：

Python >= 3.4

HTMLParser 的官方文档：Python 3

>>> from html import unescape
>>> unescape('&copy; &euro;')
© €

Python

HTMLParser的官方文档：Python 3

>>> from html.parser import HTMLParser
>>> pars = HTMLParser()
>>> pars.unescape('&copy; &euro;')
© €

注意：这已被 html.unescape() 弃用。

Python 2.7

HTMLParser的官方文档：Python 2.7

>>> import HTMLParser
>>> pars = HTMLParser.HTMLParser()
>>> pars.unescape('&copy; &euro;')
u'\xa9 \u20ac'
>>> print _
© €

【讨论】：

unescape 只是 HTMLParser 的一个内部函数（并且它没有记录在您的链接中）。但是我可以使用该实现。 10 倍很多
@brtzsnr：是的，它是无证的。不要认为它是内部的，毕竟名称是 unescape 而不是 _unescape 或 __unescape。

【解决方案2】：

现代 Python 3 方法：

>>> import html
>>> html.unescape('&copy; &euro;')
© €

https://docs.python.org/3/library/html.html

【讨论】：

【解决方案3】：

正如 Fred 指出的帖子所链接的那样，有一个函数 here 可以做到这一点。复制到这里是为了让事情变得更容易。

感谢 Fred Larson 链接到关于 SO 的另一个问题。感谢 dF 发布链接。

【讨论】：