为什么相同的 utf-8 字符串在打印时可以正常打印而在日志记录中失败？答案

【问题标题】：Why is the same utf-8 string fine in print and failing in logging?为什么相同的 utf-8 字符串在打印时可以正常打印而在日志记录中失败？
【发布时间】：2013-02-20 20:40:00
【问题描述】：

为了记录 utf-8 字符串，我是否需要手动记录 print 为我所做的事情？

for line in unicodecsv.reader(cfile, encoding="utf-8"):
    for i in line:
        print "process_clusters: From CSV: %s" % i
        print "repr: %s" % repr(i)
        log.debug("process_clusters: From CSV: %s", i)

无论字符串是拉丁文还是俄文西里尔文，我的打印语句都能正常工作。

process_clusters: From CSV: escuchan
repr: u'escuchan'
process_clusters: From CSV: говоритъ
repr: u'\u0433\u043e\u0432\u043e\u0440\u0438\u0442\u044a'

但是，log.debug 不会让我传入相同的变量。我收到此错误：

Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/logging/__init__.py", line 765, in emit
    self.stream.write(fs % msg.encode("UTF-8"))
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/codecs.py", line 686, in write
    return self.writer.write(data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 28: ordinal not in range(128)

我的日志、格式化程序和处理程序是：

log = logging.getLogger(__name__)
loglvl = getattr(logging, loglevel.upper()) # convert text log level to numeric
log.setLevel(loglvl) # set log level
handler = logging.FileHandler('inflection_finder.log', 'w', 'utf-8')
handler.setFormatter(logging.Formatter('[%(levelname)s] %(message)s'))
log.addHandler(handler)

我使用的是 Python 2.6.7。

【问题讨论】：

您还应该显示log 对象的创建以及添加到它的任何处理程序。很可能，问题就在这里。
我认为这个问题可能是 Python2.6.x 特有的。我没有设法重现问题in Python2.7。
@unutbu：我也无法在 2.6 上重现它。
添加了处理程序信息和字符串的 repr 以获得更好的可见性。
@unutbu，我希望你是对的。我安装了 2.7，但仍然可以始终如一地重现该问题。

标签： python unicode utf-8

【解决方案1】：

通过回溯阅读，日志模块似乎正在尝试在写入消息之前对其进行编码。该消息被假定为 ASCII 字符串，但它不可能是因为它包含 UTF-8 字符。如果您在将消息传递给记录器之前将其转换为 Unicode，它可能会起作用。

    log.debug(u"process_clusters: From CSV: %s", i)

编辑，我注意到您的参数字符串已经解码为 Unicode，因此我相应地更新了示例。

同样根据您的最新编辑，您可能希望在设置中使用 Unicode 字符串：

handler.setFormatter(logging.Formatter(u'[%(levelname)s] %(message)s'))
                                     --^--

【讨论】：

我在其他地方找到了这个建议，但是通过这个更改得到了完全相同的回溯。 :-/
如果变量的repr 已经有u'\u0433\u043e\u0432\u043e\u0440\u0438\u0442\u044a'，这是否意味着它已经是Unicode，已经是utf-8 编码，还是我太天真了？
老鼠。我非常希望 Formatter 编辑能够修复它，但我仍然得到了建议的两个更改的回溯。你是 2.6 吗？
@kiminoa，我在 2.7，但我必须承认我没有使用 log 模块。我只知道，如果您始终使用 Unicode，则可以保证编码为 UTF-8 的工作不会出错。您的 repr 表示字符串是 Unicode，而不是 UTF-8 - 有区别。
你喜欢不同的日志模块吗？我是 Python 的超级新手，并且渴望更多地了解什么对每个人都更有效。如果 debug == 1: print "faux log" with a proper module 根据log 模块会更好地工作的理论，我正在替换我自制的，但我在这个兔子洞的第 5 小时开始制作小的咆哮声。另外，请原谅新手元，我不知道如何@你的名字......其他人的名字有效，但@ with mark-ransom 只是空白。这是您的专属设置吗？

【解决方案2】：

Python2 中的所有字符串与 unicode 都是一团糟……幸运的是，在 Python3 中得到了纠正。但是假设迁移到 Python3 不是一种选择，那就去吧。

在我看来，在编码方面使用logging 时有两个选项：

以二进制形式打开文件并将所有字符串用作字节字符串。
以文本形式打开文件并将所有字符串用作 unicode-strings。

任何其他选择都注定失败！

问题是您在打开文件时指定了编码"utf-8"，但您的俄语文本是 unicode 字符串。

因此您可以执行以下一项（仅一项）操作：

以二进制打开文件，即从FileStream 构造中删除"utf-8" 参数。
将所有相关文本转换为 unicode 字符串，其中包括 log.debug 的参数和 logging.Formatter 的参数。

【讨论】：