【发布时间】:2015-01-28 22:47:35
【问题描述】:
此代码的目标是找出书中单词的使用频率。
我想阅读一本书的文本,但以下行一直在抛出我的代码:
珍贵的门徒。不,先生们;他总是会向他们展示干净的一对
特别是 é 字符
我看过以下文档,但不是很明白:https://docs.python.org/3.4/howto/unicode.html
这是我的代码:
import string
# Create word dictionary from the comprehensive word list
word_dict = {}
def create_word_dict ():
# open words.txt and populate dictionary
word_file = open ("./words.txt", "r")
for line in word_file:
line = line.strip()
word_dict[line] = 1
# Removes punctuation marks from a string
def parseString (st):
st = st.encode("ascii", "replace")
new_line = ""
st = st.strip()
for ch in st:
ch = str(ch)
if (n for n in (1,2,3,4,5,6,7,8,9,0)) in ch or ' ' in ch or ch.isspace() or ch == u'\xe9':
print (ch)
new_line += ch
else:
new_line += ""
# now remove all instances of 's or ' at end of line
new_line = new_line.strip()
print (new_line)
if (new_line[-1] == "'"):
new_line = new_line[:-1]
new_line.replace("'s", "")
# Conversion from ASCII codes back to useable text
message = new_line
decodedMessage = ""
for item in message.split():
decodedMessage += chr(int(item))
print (decodedMessage)
return new_line
# Returns a dictionary of words and their frequencies
def getWordFreq (file):
# Open file for reading the book.txt
book = open (file, "r")
# create an empty set for all Capitalized words
cap_words = set()
# create a dictionary for words
book_dict = {}
total_words = 0
# remove all punctuation marks other than '[not s]
for line in book:
line = line.strip()
if (len(line) > 0):
line = parseString (line)
word_list = line.split()
# add words to the book dictionary
for word in word_list:
total_words += 1
if (word in book_dict):
book_dict[word] = book_dict[word] + 1
else:
book_dict[word] = 1
print (book_dict)
# close the file
book.close()
def main():
wordFreq1 = getWordFreq ("./Tale.txt")
print (wordFreq1)
main()
我收到的错误如下:
Traceback (most recent call last):
File "Books.py", line 80, in <module>
main()
File "Books.py", line 77, in main
wordFreq1 = getWordFreq ("./Tale.txt")
File "Books.py", line 60, in getWordFreq
line = parseString (line)
File "Books.py", line 36, in parseString
decodedMessage += chr(int(item))
OverflowError: Python int too large to convert to C long
【问题讨论】:
-
您可能已经检查过了,否则请检查传入数据的编码。是 UTF-8、ISO-8859-1、WIN-1252 还是 UCS-2?没有什么比期待 UTF-8 并遇到一个高位设置的字符不是 UTF-8 而是一个 8 位字符。
-
我不知道如何在记事本中检查编码?但我也使用 iPython。我怎样才能知道?
-
Windows下NotePad++可以给你一个线索。在 Linux 上,“文件”命令会告诉你。或者使用十六进制查看器或二进制编辑器查看段落中的实际字节。
-
在记事本中编码设置为 ANSI
-
在删除标点符号时,您的意思是完全摆脱
é字符还是将其变成普通的e?
标签: python file python-3.x unicode