【发布时间】:2014-09-23 15:16:27
【问题描述】:
我正在尝试用 Unicode 字符写出一个 csv 文件,所以我使用的是 unicodecsv 包。不幸的是,我仍然收到 UnicodeDecodeErrors:
# -*- coding: utf-8 -*-
import codecs
import unicodecsv
raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
encoded_contents = unicode(raw_contents, errors='replace')
with codecs.open('test.csv', 'w', 'UTF-8') as f:
w = unicodecsv.writer(f, encoding='UTF-8')
w.writerow(["1", encoded_contents])
这是回溯:
Traceback (most recent call last):
File "unicode_test.py", line 11, in <module>
w.writerow(["1", encoded_contents])
File "/Library/Python/2.7/site-packages/unicodecsv/__init__.py", line 83, in writerow
self.writer.writerow(_stringify_list(row, self.encoding, self.encoding_errors))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 17: ordinal not in range(128)
我认为将其转换为 Unicode 就足够了,但事实并非如此。我真的很想了解正在发生的事情,以便为将来在其他项目中处理这些错误做好更好的准备。
从回溯来看,我可以像这样重现错误:
>>> raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
>>> raw_contents.encode('UTF-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)
>>>
到目前为止,我认为我对在 Python 2.x 中处理 Unicode 文本有相当的工作知识,但这让我感到谦卑。
【问题讨论】:
-
encoded_contents是一个误导性名称。unicode_text.encode(char_encoding) == bytes_data和相反的bytes_data.decode(char_encoding) == unicode_text。encoded_contents暗示(错误地)它是bytes对象,而不是unicode
标签: python unicode python-unicode