【发布时间】:2020-10-04 21:37:41
【问题描述】:
我有一个法语文本文件,我想计算其出现次数最多的单词,而不考虑停用词。代码如下:
with open('./text_file.txt', 'r', encoding='utf8') as f:
s = f.read()
num_chars = len(s)
num_lines = s.count('\n')
#call split with no arguments
words = s.split()
d = {}
for w in words:
if w in d:
d[w] += 1
else:
d[w] = 1
num_words = sum(d[w] for w in d)
lst = [(d[w],w) for w in d]
lst.sort()
lst.reverse()
# nltk treatment
from nltk.corpus import stopwords # Import the stop word list
from nltk.tokenize import wordpunct_tokenize
stop_words = set(stopwords.words('french')) # creating a set makes the searching faster
print (stop_words)
print ([word for word in lst if word not in stop_words])
print('\n The 50 most frequent words are /n')
i = 1
for count, word in lst[:50]:
print('%2s. %4s %s' %(i,count,word))
i+= 1
这会返回出现次数最多的单词,包括停用词。有更好的办法吗?
【问题讨论】:
-
您可以提前加载
stop_words并在if w in d中检查它们。这样您就不必先计算它们,然后再删除它们。