Python ISO-8859-1 编码答案

【问题标题】：Python ISO-8859-1 encodingPython ISO-8859-1 编码
【发布时间】：2011-12-27 23:18:57
【问题描述】：

在处理 ISO-8859-1 / Latin-1 字符集时，我在 Python 中遇到了一个巨大的编码问题。

当使用os.listdir 获取文件夹的内容时，我得到了以 ISO-8859-1 编码的字符串（例如：''Ol\xe1 Mundo''），但是在 Python 解释器中，相同的字符串是编码为不同的字符集：

In : 'Olá Mundo'.decode('latin-1')
Out: u'Ol\xa0 Mundo'

如何强制 Python 将字符串解码为相同的格式？我看到os.listdir 正在返回正确编码的字符串，但解释器不是（'á' 字符对应于 ISO-8859-1 中的'\xe1'，而不是'\xa0'）：

http://en.wikipedia.org/wiki/ISO/IEC_8859-1

关于如何克服的任何想法？

【问题讨论】：

有点迂腐：os.listdir() 返回字节。命名文件的进程选择给文件一个名称，该名称在 iso-8859-1 中具有特定解释。文件名可以很容易地存储在 BIG-5 或 JIS 中，os.listdir() 不会在意。
@sarnold: os.listdir() 可以返回字符串或字节：这取决于您传入的内容以及目录的内容。

标签： python unicode encoding iso-8859-1

【解决方案1】：

当您在 python2 交互式会话中输入非 unicode 字符串文字时，将为其假定系统默认编码。

看来您使用的是 windows，因此默认编码可能是“cp850”或“cp437”：

C:\>python
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin.encoding
'cp850'
>>> 'Olá Mundo'
'Ol\xa0 Mundo'
>>> u'Olá Mundo'.encode('cp850')
'Ol\xa0 Mundo'

如果将代码页更改为 1252（大致相当于 latin1），字符串将按预期显示：

C:\>chcp 1252
Active code page: 1252

C:\>python
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin.encoding
'cp1252'
>>> 'Olá Mundo'
'Ol\xe1 Mundo'

【讨论】：