使用具有默认 ascii 编码的非 ascii 字符？答案

【问题标题】：Using non-ascii characters with the default ascii encoding?使用具有默认 ascii 编码的非 ascii 字符？
【发布时间】：2014-04-22 17:59:55
【问题描述】：

我使用 Python 2.7。这个page 说：

Python 的默认编码是‘ascii’编码

我确实有以下几点：

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

但我打开我的解释器并输入：

>>> 'É'
'\xc3\x89'

看起来像utf8:

>>> u'É'.encode( 'utf8' )
'\xc3\x89'

发生了什么？默认的ascii 提高了UnicodeEncodeError 吗？是否触发了utf8 编码？

【问题讨论】：

标签： python-2.7 utf-8 character-encoding ascii python-unicode

【解决方案1】：

您的终端配置为使用 UTF-8。它将 UTF-8 数据发送到 Python。 Python 将这些数据存储在一个字节串中。

当您打印该字节串时，终端再次将这些字节解释为 UTF-8。

毫无意义的是，Python 实际上将这些字节解释为原始字节以外的任何内容，在 Python 级别上不会进行任何解码或编码。

如果您尝试隐式解码字节，则会引发异常：

>>> unicode('\xc3\x89')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

这里 Python 使用了sys.getdefaultencoding() 并且解码失败。

对于在交互式提示中输入stdin 以创建Unicode 文字（使用u'...'），Python 不使用sys.getdefaultencoding()，而是使用sys.stdin.encoding 值：

>>> import sys
>>> sys.stdin.encoding
'UTF-8'

Python 从PYTHIONIOENCODING environment variable（如果已设置）或locale.getpreferredencoding() 获取哪个：

>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'

在读取 Python 源文件时，Python 2 将使用 ASCII 来解释此类文字，Python 3 将使用 UTF-8。两者都可以使用 PEP 263 源编码注释来告知要使用什么编解码器，该注释必须位于输入文件的第一行或第二行：

# coding: UTF-8

【讨论】：

注意：在 Python 3 中有os.device_encoding(fd)。值得一提的是：更改源代码编码对sys.getdefaultencoding() 没有影响。 REPL 使用可能与 sys.stdin.encoding 不同的 locale.getpreferredencoding()（这就是为什么非 ascii 内部文字可能在没有编码声明的情况下默认工作的原因）