Python 为单个 Unicode 字符串返回长度为 2答案

【问题标题】：Python returns length of 2 for single Unicode character stringPython 为单个 Unicode 字符串返回长度为 2
【发布时间】：2015-05-20 12:32:02
【问题描述】：

在 Python 2.7 中：

In [2]: utf8_str = '\xf0\x9f\x91\x8d'
In [3]: print(utf8_str)
????
In [4]: unicode_str = utf8_str.decode('utf-8')
In [5]: print(unicode_str)
???? 
In [6]: unicode_str
Out[6]: u'\U0001f44d'
In [7]: len(unicode_str)
Out[7]: 2

既然unicode_str 只包含一个unicode 代码点（0x0001f44d），为什么len(unicode_str) 返回2 而不是1？

【问题讨论】：

标签： python python-2.7 unicode python-unicode

【解决方案1】：

您的 Python 二进制文件是在 UCS-2 支持下编译的（narrow 构建），并且在内部使用 surrogate pair 表示 BMP（基本多语言平面）之外的任何内容。

这意味着在询问长度时，此类代码点显示为 2 个字符。

如果这很重要，您必须重新编译 Python 二进制文件以使用 UCS-4（./configure --enable-unicode=ucs4 将启用它），或者升级到 Python 3.3 或更高版本，其中Python's Unicode support was overhauled 使用可变宽度 Unicode 类型根据所包含的代码点的要求在 ASCII、UCS-2 和 UCS-4 之间切换。

在 Python 版本 2.7 和 3.0 - 3.2 上，您可以通过检查 sys.maxunicode value 来检测您拥有的构建类型； 2^16-1 == 65535 == 0xFFFF 用于狭窄的 UCS-2 构建，1114111 == 0x10FFFF 用于广泛的 UCS-4 构建。在 Python 3.3 及更高版本中，它始终设置为 1114111。

演示：

# Narrow build
$ bin/python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
65535 2 [u'\ud83d', u'\udc4d']
# Wide build
$ python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
1114111 1 [u'\U0001f44d']

【讨论】：

您也可以在 Python 3 上使用 sys.maxunicode。这是隐含的，但值得明确指出的是，len(u'\U0001f44d') == 1 在 Python 3.3+（或广泛的 Python 2 构建）上
@J.F.Sebastian：当然，但是从 3.3 开始，它是一个常量，因为 Python 3.3 及更高版本根据需要在 ASCII、UCS-2 和 UCS-4 存储之间透明地切换字符串。而且你真的不想使用 Python
Python 3.3+ 没有窄/宽的区别（内部表示没有公开——你不关心 python 在内部使用什么）。无论版本如何，您都可以使用sys.maxunicode。
我从来没有说过有这样的区别。
是的，这就是为什么narrow_mode = (sys.maxunicode < 0x10ffff) 可以在任何版本（Python 2 和 3）上使用。