【发布时间】:2016-06-15 14:44:43
【问题描述】:
长话短说,我正在编写一个 Python 脚本,要求用户删除一个 .docx 文件,然后该文件将转换为 .txt。 Python 在 .txt 文件中查找关键字并将它们显示给 shell。我遇到了UnicodeDecodeError编解码器charmap等......我通过在我的for循环中编写“word.decode(”charmap“)克服了这个问题。现在,Python没有向shell显示它找到的关键字。任何关于如何克服这个问题?也许 Python 会跳过它无法解码的字符并继续阅读其余字符?这是我的代码:
import sys
import os
import codecs
filename = input("Drag and drop resume here: ")
keywords =['NGA', 'DoD', 'Running', 'Programing', 'Enterprise', 'impossible', 'meets']
file_words = []
with open(filename, "rb") as file:
for line in file:
for word in line.split():
word.decode("charmap")
file_words.append(word)
comparison = []
for words in file_words:
if words in keywords:
comparison.append(words)
def remove_duplicates(comparison):
output = []
seen = set()
for words in comparison:
if words not in seen:
output.append(words)
seen.add(words)
return output
comparison = remove_duplicates(comparison)
print ("Keywords found:",comparison)
key_count = 0
word_count = 0
for element in comparison:
word_count += 1
for element in keywords:
key_count += 1
Threshold = word_count / key_count
if Threshold <= 0.7:
print ("The candidate is not qualified for")
else:
print ("The candidate is qualified for")
file.close()
还有输出:
Drag and drop resume here: C:\Users\User\Desktop\Resume_Newton Love_151111.txt
Keywords found: []
The candidate is not qualified for
【问题讨论】:
-
试试这个:
word.decode('utf-8',errors='ignore') -
谢谢,关键字下仍然没有输出任何东西。我自己已经阅读了这个文件,其中肯定有我的程序应该识别的关键词。它适用于我扫描过的其他几个文件。也许那些不可解码的字符会中断阅读过程?
-
你为什么将
"charmap"传递给decode?你能提供一个重现问题的小文本示例吗? -
以下是一个样本:ðïà±á>Þ图>þ图:
-
如何告诉 Python 过滤这些字符?
标签: python python-3.x unicode utf-8