为什么打印到 utf-8 文件失败？答案

【问题标题】：Why does printing to a utf-8 file fail?为什么打印到 utf-8 文件失败？
【发布时间】：2011-09-25 12:37:30
【问题描述】：

所以我今天下午遇到了一个问题，我能够解决它，但我不太明白它为什么会起作用。

这与我前一周遇到的一个问题有关：python check if utf-8 string is uppercase

基本上，以下是行不通的：

#!/usr/bin/python

import codecs
from lxml import etree

outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()

root = etree.Element('root')
sect = etree.SubElement(root,'sect')


words = (   u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
            u'R\xc9SUM\xc9',    # RESUME with accents
            u'R\xe9sum\xe9',    # Resume with accents
            u'R\xe9SUM\xe9', )  # ReSUMe with accents

for word in words:
    print word
    if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8 
        title = etree.SubElement(sect,'title')
        title.text = word
    else:
       item = etree.SubElement(sect,'item')
       item.text = word 

print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')

它失败了：

回溯（最近一次通话最后一次）：
文件“./temp.py”，第 25 行，在
打印 >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
文件“/usr/lib/python2.7/codecs.py”，
第 691 行，写入中
返回 self.writer.write(data) 文件“/usr/lib/python2.7/codecs.py”，
第 351 行，写入中
数据，消费 = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' 编解码器
无法解码位置 66 中的字节 0xd0：
序数不在范围内（128）

但是如果我打开没有codecs.open('test.xml', 'w', 'utf-8') 的新文件，而是使用 outFile = open('test.xml', 'w') 它运行良好。

那么发生了什么？

既然在etree.tostring() 中指定了encoding='utf-8'，它会再次对文件进行编码吗？
如果我离开codecs.open() 并删除encoding='utf-8'，该文件就会变成一个ascii 文件。为什么？因为我认为etree.tostring() 有默认的 ascii 编码？
但是etree.tostring() 只是被写入标准输出，然后被重定向到一个作为 utf-8 文件创建的文件？
- print>> 没有按我的预期工作吗？ outFile.write(etree.tostring()) 的行为方式相同。

基本上，为什么这不起作用？这里发生了什么。这可能是微不足道的，但我显然有点困惑，并想弄清楚为什么我的解决方案有效，

【问题讨论】：

看起来 tostring() 产生了一个字符串，并且写入使用 codec.open 打开的文件需要 unicode。尝试使用纯 open() 打开文件，并在调用 tostring() 时保留 encoding='utf-8' 参数。另外，word.encode('utf8').decode('utf8')!?
@Thomas K, 1. 我可以使用open()，只是好奇为什么。 2.word.encode('utf8').decode('utf8')一定是解释错了。我确保您需要在我的项目中使用 word.decode('utf8') ，它的单词从不同的文件提供给它。有关 .isupper() 和 utf8 的更多信息，请参阅此内容 => stackoverflow.com/questions/6391442/…
您对 tostring() 的调用会生成一个编码字符串（不是 unicode）。用 codecs.open() 打开的文件期望接收 unicode，所以当你给它一个字节串时，它会阻塞（事实上，它试图解码为 ascii 并重新编码为 utf-8）。 2：我知道您可能需要解码，但您正在编码，然后立即使用相同的编解码器进行解码，这样您就可以恢复开始时的内容。我强烈推荐阅读：joelonsoftware.com/articles/Unicode.html
@matchew：你可能想看看that的答案。

标签： python unicode encoding utf-8 lxml

【解决方案1】：

除了 MRABs 回答一些代码行：

import codecs
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')

# do some other xml building here

with codecs.open('test.xml', 'w', encoding='utf-8') as f:
    f.write(etree.tostring(root, encoding=unicode))

【讨论】：

【解决方案2】：

您已使用 UTF-8 编码打开文件，这意味着它需要 Unicode 字符串。

tostring 编码为 UTF-8（以字节串的形式，str），您将其写入文件。

因为文件需要 Unicode，所以它使用默认的 ASCII 编码将字节串解码为 Unicode，这样它就可以将 Unicode 编码为 UTF-8。

不幸的是，字节串不是 ASCII。

编辑：避免此类问题的最佳建议是在内部使用 Unicode，对输入进行解码，对输出进行编码。

【讨论】：

现在我已经退出了这个问题，你的回答似乎是最有意义的。太糟糕了，Thomas K 确实回答了，因为他基本上说了同样的话。

【解决方案3】：

使用print>>outFile 有点奇怪。我没有安装lxml，但是内置的xml.etree 库是类似的（但不支持pretty_print）。将 root 元素包装在 ElementTree 中并使用 write 方法。

另外，如果您使用# coding 行来声明源文件的编码，您可以使用可读的 Unicode 字符串而不是转义码：

#!/usr/bin/python
# coding: utf8

import codecs
from xml.etree import ElementTree as etree

root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')


words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']

for word in words:
    print word
    if word.isupper():
        title = etree.SubElement(sect,u'title')
        title.text = word
    else:
       item = etree.SubElement(sect,u'item')
       item.text = word 

tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

【讨论】：

我很喜欢print>>outFile，但我同意它不是很pythonic。总的来说，这篇文章没有告诉我任何我真的不知道的事情。感谢您试一试。
@Mark Tolenen：他从我写给他之前的一个问题的答案中复制了带有“转义码”的数据。使用“转义码”（repr() 的输出）的目的是减少四处飘荡的歧义......人们需要非常好的眼睛来区分 u'MOCKBA'（拉丁文）和 u'МОСКВА'（西里尔文））。对他之前的问题的蹩脚答案适用于拉丁语，而适用于西里尔语。