什么是 unicode 字符串？ [关闭]答案

【问题标题】：What is a unicode string? [closed]什么是 unicode 字符串？ [关闭]
【发布时间】：2014-03-15 13:02:39
【问题描述】：

究竟什么是 unicode 字符串？

普通字符串和unicode字符串有什么区别？

什么是 utf-8？

我现在正在尝试学习 Python，并且一直听到这个流行语。下面的代码是做什么的？

i18n 字符串 (Unicode)

> ustring = u'A unicode \u018e string \xf1'
> ustring
u'A unicode \u018e string \xf1'

## (ustring from above contains a unicode string)
> s = ustring.encode('utf-8')
> s
'A unicode \xc6\x8e string \xc3\xb1'  ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8')             ## Convert bytes back to a unicode string
> t == ustring                      ## It's the same as the original, yay!
True

文件 Unicode

import codecs

f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string

【问题讨论】：

互联网搜索可能是一个不错的起点......
Unicode in Python的可能重复
另见bit.ly/unipain

标签： python unicode utf-8

【解决方案1】：

更新：Python 3

在 Python 3 中，Unicode 字符串是默认值。 str 类型是 Unicode 代码点的集合，bytes 类型用于表示 8 位整数（通常解释为 ASCII 字符）的集合。

这是问题中的代码，针对 Python 3 进行了更新：

>>> my_str = 'A unicode \u018e string \xf1' # no need for "u" prefix
# the escape sequence "\u" denotes a Unicode code point (in hex)
>>> my_str
'A unicode Ǝ string ñ'
# the Unicode code points U+018E and U+00F1 were displayed
# as their corresponding glyphs
>>> my_bytes = my_str.encode('utf-8') # convert to a bytes object
>>> my_bytes
b'A unicode \xc6\x8e string \xc3\xb1'
# the "b" prefix means a bytes literal
# the escape sequence "\x" denotes a byte using its hex value
# the code points U+018E and U+00F1 were encoded as 2-byte sequences
>>> my_str2 = my_bytes.decode('utf-8') # convert back to str
>>> my_str2 == my_str
True

处理文件：

>>> f = open('foo.txt', 'r') # text mode (Unicode)
>>> # the platform's default encoding (e.g. UTF-8) is used to decode the file
>>> # to set a specific encoding, use open('foo.txt', 'r', encoding="...")
>>> for line in f:
>>>     # here line is a str object

>>> f = open('foo.txt', 'rb') # "b" means binary mode (bytes)
>>> for line in f:
>>>     # here line is a bytes object

历史答案：Python 2

在 Python 2 中，str 类型是 8 位字符的集合（如 Python 3 的 bytes 类型）。英文字母表可以用这些 8 位字符来表示，但 Ω、и、± 和 ♠ 等符号不能。

Unicode 是处理各种字符的标准。每个符号都有一个代码点（一个数字），这些代码点可以使用多种编码方式进行编码（转换为字节序列）。

UTF-8 就是这样一种编码。低码位使用单个字节编码，高码点编码为字节序列。

为了允许使用 Unicode 字符，Python 2 有一个 unicode 类型，它是 Unicode 代码点的集合（如 Python 3 的 str 类型）。 ustring = u'A unicode \u018e string \xf1' 行创建了一个包含 20 个字符的 Unicode 字符串。

当 Python 解释器显示 ustring 的值时，它会转义两个字符（Ǝ 和 ñ），因为它们不在标准的可打印范围内。

s = unistring.encode('utf-8') 行使用 UTF-8 对 Unicode 字符串进行编码。这会将每个代码点转换为适当的字节或字节序列。结果是一个字节集合，返回为str。 s 的大小为 22 字节，因为其中两个字符具有高码位，并且被编码为两个字节的序列而不是单个字节。

当 Python 解释器显示 s 的值时，它会转义四个不在可打印范围内的字节（\xc6、\x8e、\xc3 和 \xb1）。这两对字节不像以前那样被视为单个字符，因为s 的类型是str，而不是unicode。

t = unicode(s, 'utf-8') 行与encode() 正好相反。它通过查看s 的字节并解析字节序列来重构原始代码点。结果是一个 Unicode 字符串。

对codecs.open() 的调用将utf-8 指定为编码，这告诉Python 将文件的内容（字节集合）解释为使用UTF-8 编码的Unicode 字符串。

【讨论】：

更具体地说，上述情况适用于 Python v2。在 Python v3 中，Unicode 字符串是默认的。
谢谢，...但是我们什么时候才能真正“看到”那些 unicode 字符？我们是否会将我们的 Python 代码“注入”到能够显示这些代码的系统中？
通常现在，如果您只是将字符串打印到控制台输出，或者将其写入文件，然后在编辑器中查看，您将能够看到任何非 ascii 字符。由于 utf8 大部分都向后兼容 ascii，因此大多数系统现在应该默认采用 utf8 编码。（出于同样的原因，您应该能够将 unicode 字符直接保存到 .py 文件中，并跳过转义的表示。）@aderchox

【解决方案2】：

Python 支持字符串类型和 unicode 类型。字符串是字符序列，而 unicode 是“指针”序列。 unicode 是序列的内存表示形式，其上的每个符号都不是字符，而是用于在映射中选择字符的数字（十六进制格式）。 所以 unicode var 没有编码，因为它不包含字符。

【讨论】：

你可以在这个博客carlosble.com/2010/12/understanding-python-and-unicode进行详细了解
-1 不是一个准确的答案。这些不是“指针”，两种类型都用于表示字符串。