.txt 文件的 Python Unicode 问题答案

【问题标题】：Python Unicode issues with .txt file.txt 文件的 Python Unicode 问题
【发布时间】：2016-06-15 14:44:43
【问题描述】：

长话短说，我正在编写一个 Python 脚本，要求用户删除一个 .docx 文件，然后该文件将转换为 .txt。 Python 在 .txt 文件中查找关键字并将它们显示给 shell。我遇到了UnicodeDecodeError编解码器charmap等......我通过在我的for循环中编写“word.decode（”charmap“）克服了这个问题。现在，Python没有向shell显示它找到的关键字。任何关于如何克服这个问题？也许 Python 会跳过它无法解码的字符并继续阅读其余字符？这是我的代码：

import sys
import os
import codecs

filename = input("Drag and drop resume here: ")
keywords =['NGA', 'DoD', 'Running', 'Programing', 'Enterprise', 'impossible', 'meets']
file_words = []

with open(filename, "rb") as file:
        for line in file:
            for word in line.split():
                word.decode("charmap")
                file_words.append(word)

comparison = []

for words in file_words:
    if words in keywords:
        comparison.append(words)

def remove_duplicates(comparison):
    output = []
    seen = set()
    for words in comparison:
        if words not in seen:
            output.append(words)
            seen.add(words)
    return output

comparison = remove_duplicates(comparison)
print ("Keywords found:",comparison)

key_count = 0
word_count = 0

for element in comparison:
    word_count += 1
for element in keywords:
    key_count += 1

Threshold = word_count / key_count

if Threshold <= 0.7:
    print ("The candidate is not qualified for")
else:
    print ("The candidate is qualified for")

file.close()

还有输出：

Drag and drop resume here: C:\Users\User\Desktop\Resume_Newton Love_151111.txt
Keywords found: []
The candidate is not qualified for

【问题讨论】：

试试这个：word.decode('utf-8',errors='ignore')
谢谢，关键字下仍然没有输出任何东西。我自己已经阅读了这个文件，其中肯定有我的程序应该识别的关键词。它适用于我扫描过的其他几个文件。也许那些不可解码的字符会中断阅读过程？
你为什么将"charmap" 传递给decode？你能提供一个重现问题的小文本示例吗？
以下是一个样本：ðïà±á>Þ图>þ图：
如何告诉 Python 过滤这些字符？

标签： python python-3.x unicode utf-8

【解决方案1】：

在 Python 3 中，不要以二进制模式打开文本文件。默认情况下，文件将使用locale.getpreferredencoding(False)（美国 Windows 上为cp1252）解码为 Unicode：

with open(filename) as file:
    for line in file:
        for word in line.split():
            file_words.append(word)

或指定编码：

with open(filename, encoding='utf8') as file:
    for line in file:
        for word in line.split():
            file_words.append(word)

您确实需要知道文件的编码。 open 也有其他选项，包括 errors='ignore' 或 errors='replace'，但如果您知道正确的编码，则不会出错。

正如其他人所说，发布重现错误和错误回溯的文本文件示例将有助于诊断您的具体问题。

【讨论】：

当我使用“charmap”和“ISO-8859-1”时，它们不会给出任何错误，但 Python 会停止读取文件。使用“utf-8”会给我错误。这是我在较早的cmets之一中提供的样品：ðïà±á>þ图：
@E_R，这不是编码文本文件。为什么你认为你可以解码它？一些代码页解码（如 ISO-8859-1）将 256 个可能的字节转换为 256 个可能的 Unicode 字符。它永远不会转换失败，但如果文件不是以 ISO-8859-1 编码格式开始的，你就会得到垃圾。

【解决方案2】：

也许发布产生回溯的代码会更容易修复。

我不确定这是唯一的问题，也许这样会更好：

with open(filename, "rb") as file:
    for line in file:
        for word in line.split():
            file_words.append(word.decode("charmap"))

【讨论】：

Traceback（最近一次调用最后）：文件“C:\Users\User\Desktop\ResumeScan.py”，第 12 行，在 file_words.append(word.decode("charmap" )) AttributeError: 'str' 对象没有属性 'decode'
我认为 Python 3.5.1 不像 Python 2.7 那样采用“decode()”
words.decode(encoding='UTF-8', errors='strict') # 假设 UTF-8 NFC 输入
啊啊在 Python 3 中确实没有 decode() ......尝试：除了：可能适合您的解码问题

【解决方案3】：

好吧，我想通了。这是我的代码，但我尝试了一个似乎更复杂的 docx 文件，当转换为 .txt 时，整个文件由特殊字符组成。所以现在我想我应该去 python-docx 模块，因为它处理像 Word 文档这样的 xml 文件。我添加了“encoding = 'charmap'”

with open(filename, encoding = 'charmap') as file:
    for line in file:
        for word in line.split():
            file_words.append(word)

【讨论】：