【问题标题】：Error when reading UTF-8 characters with python使用 python 读取 UTF-8 字符时出错
【发布时间】：2017-02-17 16:07:12
【问题描述】：

我在 python 中有以下函数，它接受一个字符串作为参数并以 ASCII 码返回相同的字符串（例如“alçapão”->“alcapao”）：

def filt(word):
    dic = { u'á':'a',u'ã':'a',u'â':'a' } # the whole dictionary is too big, it is just a sample
    new = ''
    for l in word:
        new = new + dic.get(l, l)
    return new

它应该“过滤”我从文件中读取的列表中的所有字符串：

lines = []
with open("to-filter.txt","r") as f:
    for line in f:
        lines.append(line.strip())

lines = [filt(l) for l in lines]

但我明白了：

filt.py:9: UnicodeWarning: Unicode equal comparison failed to convert 
  both arguments to Unicode - interpreting them as being unequal 
  new = new + dic.get(l, l)

过滤后的字符串包含 '\xc3\xb4' 之类的字符，而不是 ASCII 字符。我该怎么办？

【问题讨论】：

哪个版本的python？不同版本之间处理 UTF-8 的方式存在重大差异
2.7.12（Ubuntu的版本）

标签： python python-2.7 utf-8 character-encoding

【解决方案1】：

您正在混合和匹配 Unicode 字符串和常规（字节）字符串。

使用 io 模块打开您的文本文件并将其解码为 Unicode：

with io.open("to-filter.txt","r", encoding="utf-8") as f:

这假定您的 to-filter.txt 文件是 UTF-8 编码的。

你也可以将你的文件读入一个数组，只需：

with io.open("to-filter.txt","r", encoding="utf-8") as f:
    lines = f.read().splitlines()

lines 现在是 Unicode 字符串列表。

可选

您似乎正在尝试将非 ASCII 字符转换为最接近的 ASCII 字符。最简单的方法是：

import unicodedata
def filt(word):
    return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')

这是做什么的：

将每个字符分解为其组成部分。例如，ã 可以表示为单个 Unicode 字符（U+00E3 'LATIN SMALL LETTER A WITH TILDE'）或两个 Unicode 字符：U+0061 'LATIN SMALL LETTER A' + U+0303 'COMBINING TILDE' .
将组件部分编码为 ASCII。非 ASCII 部分（代码点大于 U+007F 的部分）将被忽略。
为方便起见，解码回 Unicode str。

Tl;博士

您的代码现在是：

import unicodedata
def filt(word):
    return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')

with io.open("to-filter.txt","r", encoding="utf-8") as f:
    lines = f.read().splitlines()

lines = [filt(l) for l in lines]

Python 3.x

虽然不是绝对必要，但从open() 中删除io

【讨论】：

【解决方案2】：

问题的根源在于您不是从文件中读取 Unicode 字符串，而是在读取字节字符串。有三种方法可以解决此问题，首先是按照另一个答案的建议使用 io 模块打开文件。第二种是在阅读时转换每个字符串：

with open("to-filter.txt","r") as f:
    for line in f:
        lines.append(line.decode('utf-8').strip())

第三种方法是使用 Python 3，它总是将文本文件读入 Unicode 字符串。

最后，无需编写自己的代码即可将重音字符转换为纯 ASCII，有一个包 unidecode 可以做到这一点。

from unidecode import unidecode
print(unidecode(line))

【讨论】：

module unidecode将字符串转换为'\ u0646 \ u0638 \ u0627 \ u0631 \ u0631 \ u062a'，例如' آمریکا قصد براندازی آن و اختلاف بین مسلمین را دارد'。