【问题标题】:most frequently occurring words of a text file excluding stopwords文本文件中出现频率最高的词,不包括停用词
【发布时间】:2020-10-04 21:37:41
【问题描述】:

我有一个法语文本文件,我想计算其出现次数最多的单词,而不考虑停用词。代码如下:

with open('./text_file.txt', 'r', encoding='utf8') as f:
    s = f.read()

num_chars = len(s)
num_lines = s.count('\n')

#call split with no arguments
words = s.split()
d = {}
for w in words:
    if w in d:
        d[w] += 1
    else:
        d[w] = 1

num_words = sum(d[w] for w in d)

lst = [(d[w],w) for w in d]
lst.sort()
lst.reverse()

# nltk treatment
from nltk.corpus import stopwords # Import the stop word list
from nltk.tokenize import wordpunct_tokenize

stop_words = set(stopwords.words('french')) # creating a set makes the searching faster
print (stop_words)
print ([word for word in lst if word not in stop_words])


print('\n The 50 most frequent words are /n')

i = 1
for count, word in lst[:50]:
    print('%2s. %4s %s' %(i,count,word))
    i+= 1

这会返回出现次数最多的单词,包括停用词。有更好的办法吗?

【问题讨论】:

  • 您可以提前加载stop_words 并在if w in d 中检查它们。这样您就不必先计算它们,然后再删除它们。

标签: python nltk


【解决方案1】:

这是一个简化版:

from nltk.corpus import stopwords # Import the stop word list
from nltk.tokenize import wordpunct_tokenize

with open('./text_file.txt', 'r', encoding='utf8') as f:
    words = f.read().split()

d = {}
stop_words = set(stopwords.words('french')) # creating a set makes the searching faster
for w in words:
    if w not in stop_words:
        if w in d:
            d[w] += 1
        else:
            d[w] = 1

lst = sorted([(d[w],w) for w in d],reverse=True)
print (stop_words)
print ([word for word in lst if word not in stop_words])
print('\n The 50 most frequent words are /n')

i = 1
for count, word in lst[:50]:
    print('%2s. %4s %s' %(i,count,word))
    i += 1

【讨论】:

  • 您好,感谢您的宝贵时间。我仍然有同样的问题,即“de”(法语中“The”的翻译)是最常用的词。我想删除那些“通用”词。这就是我使用 nltk 的原因
  • @user93804 我添加了。
  • 你好,还是不行。它给了我以下错误:TypeError:'WordListCorpusReader'类型的参数不可迭代
  • 我的错,忘记了下划线。
【解决方案2】:
with open("/yourFile.txt", "r") as file:
    words = file.read().split()

    cptwords = {}

    for word in words:
        if word[-1] in [",", ".", "\n", ":", "!", "?", ";"]:
            word.rstrip()

        cptwords.setdefault(word, 0)
        cptwords[word] += 1

    cptwords = sorted(cptwords.items(), key = lambda x: x[1], reverse = True)

    print(f"The first 50 most used words are {[truc[0] for truc in cptwords[:50]]}")

这是一个简单的方法。

【讨论】:

  • 您好,感谢您的宝贵时间。我仍然有同样的问题,那就是我有“de”(法语中“The”的翻译)作为最常用的词。我想删除那些“通用”词。这就是我使用 nltk 的原因
【解决方案3】:

这是一个使用collections.Counter 的更清洁(可能更快)的解决方案:

from collections import Counter
from nltk.corpus import stopwords # Import the stop word list
from nltk.tokenize import wordpunct_tokenize
NUM_WORDS = 50

with open('./text_file.txt', 'r', encoding='utf8') as f:
    words = f.read().split()

word_counts = Counter(word for word in words
                      if word not in set(stopwords.words('french')))
print(f'\nThe {NUM_WORDS} most frequent words are:\n')
for i, (word, count) in enumerate(word_counts.most_common(NUM_WORDS)):
    print('%2s. %4s %s' % (i, count, word))

【讨论】:

  • 感谢您的宝贵时间。 45分钟,代码还在运行……正常吗?
【解决方案4】:

NLTK 有一个用于计算频率的类,称为FreqDist,它提供了许多方便的方法。您可以按如下方式使用它:

from nltk.tokenize import wordpunct_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords


with open('text_file.txt', 'r', encoding='utf8') as f:
    text = f.read()

fd = FreqDist(
    word
    for word in wordpunct_tokenize(text)
    if word not in set(stopwords.words('french'))
)
fd.pprint()

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2017-07-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-05-02
    • 1970-01-01
    • 2018-10-08
    相关资源
    最近更新 更多