如何使用 HTML 实体将 Unicode 编码为 ASCII答案

【问题标题】：How to encode Unicode to ASCII with HTML entities如何使用 HTML 实体将 Unicode 编码为 ASCII
【发布时间】：2013-10-17 18:57:36
【问题描述】：

我需要在 Python 中使用 HTML 实体将 unicode UTF-8 字符串编码为 ASCII。

要明确：

source = u"Hello…"
wanted = "Hello&hellip;"

这不是解决方案：

as_ascii = source.encode('ascii', 'xmlcharrefreplace')

因为as_ascii 将被设置为Hello&#8230; - 即使用XML 字符引用，而不是HTML。

是否有 Python 模块/函数/实体字典可以：

使用 HTML 字符引用将 unicode 解码为 ASCII。
将包含 XML 字符引用的 ASCII 字符串替换为 HTML 字符引用（视情况而定）。

【问题讨论】：

对于实体字典，htmlentitydefs.codepoint2name 对方法 2 有帮助吗？ htmlentitydefs.codepoint2name[8230] == "hellip".
是的！谢谢。我可以使用 htmlentitydefs！
我不得不从 htmlentitydefs 包中取出一些元素，但我想出了这个 -- gist.github.com/jvanasco/7030697
数字字符引用在 HTML 中与在 XML 中一样有效，您可能需要它们用于所有没有 HTML 特定实体的字符。
是的，我知道它们在渲染时是等价的。我特别想要 HTML 实体。

标签： python unicode encoding utf-8

【解决方案1】：

示例程序（文件decode_to_entity.py）：

#-*- coding: utf-8 -*-

import htmlentitydefs as entity

def decode_to_entity(s):
        t = ""
        for i in s:
                if ord(i) in entity.codepoint2name:
                        name = entity.codepoint2name.get(ord(i))
                        t += "&" + name + ";"
                else:
                        t += i
        return t



print(decode_to_entity(u"Hello…"))

示例执行：

$ python decode_to_entity.py
Hello&hellip;

【讨论】：

哇。这适用于我的 unicode 字符串，但我不明白它为什么会起作用。我有日文文本，“xmlcharrefreplace”也适用于网络显示，但担心来自网络的输入不会存储为正确的 utf8。如何反转此过程以将 Web 输出文本以 UTF8 格式存储在数据库中？
因为ord返回char的整数值（可以大于255）。看看这个：a=u'Ś'这是u'\u015a'，十六进制的15a是十进制格式的346（人类可读）（ord(a)将返回346）。你可以在这里阅读更多：docs.python.org/2/howto/…