计算文本中每个单词的出现次数 - Python答案

【问题标题】：Count the number occurrences of each word in a text - Python计算文本中每个单词的出现次数 - Python
【发布时间】：2018-08-25 05:55:28
【问题描述】：

我知道我可以在文本/数组中找到一个单词：

if word in text: 
   print 'success'

我想要做的是阅读文本中的一个单词，并不断计算找到该单词的次数（这是一个简单的计数器任务）。但问题是我真的不知道如何read 已经阅读过的单词。最后：统计每个单词出现的次数？

我想过保存在一个数组中（甚至是多维数组，所以保存单词和它出现的次数，或者保存在两个数组中），每次在该数组中出现一个单词时求和 1。

那么，当我读一个单词时，我能不能用类似的东西来读它：

if word not in wordsInText: 
       print 'success'

【问题讨论】：

让我直截了当地说：您要计算每个单词的出现次数？
正确。抱歉，如果没有很好地解释@AlexHristov
你可以使用字典，以单词为键，我的数字为值。
我相信我们需要更多，例如，您是否解析了文本的标点符号？
这里...几天前我得到了一个类似问题的获胜答案。它可能会有所帮助。 stackoverflow.com/questions/49222636/…

标签： python text

【解决方案1】：

我的理解是，您希望保留已读过的单词，以便检测是否遇到新单词。那样可以么？最简单的解决方案是使用集合，因为它会自动删除重复项。例如：

known_words = set()
for word in text:
    if word not in known_words:
        print 'found new word:', word
    known_word.add(word)

另一方面，如果您需要每个单词的确切出现次数（这在数学中称为“直方图”），则必须将集合替换为字典：

histo = {}
for word in text:
    histo[word] = histo.get(word, 0) + 1
print histo

注意：在这两种解决方案中，我认为文本包含可迭代的单词结构。正如其他 cmets 所说，str.split() 对此并不完全安全。

【讨论】：

sets 不计算字数，但是？我怎么知道我将在其他地方保护的哪些计数器，然后分配给哪些单词？
我编辑了我的初始答案以使用字典添加解决方案计数出现次数

【解决方案2】：

我会使用以下方法之一：

1) 如果单词不包含空格，但文本包含空格，则使用

for piece in text.split(" "):
   ...

那么你的单词应该在每首曲子中最多出现一次，并且被正确计数。例如，如果您想在“Baden-Baden”中计算“Baden”两次，这将失败。

2) 使用字符串方法 'find' 不仅可以获取单词是否存在，还可以获取它在哪里。数一数，然后从那个点开始继续搜索。 text.find(word) 返回一个位置或 -1。

【讨论】：

【解决方案3】：

现在我们确定了您想要实现的目标，我可以给您一个答案。现在您需要做的第一件事是将文本转换为单词列表。虽然split 方法可能看起来是一个很好的解决方案，但当句子以单词结尾，后跟句号、逗号或任何其他字符时，它会在实际计数中产生问题。所以这个问题的一个很好的解决方案是NLTK。假设您拥有的文本存储在名为text 的变量中。您要查找的代码如下所示：

from itertools import chain
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize

text = "This is an example text. Let us use two sentences, so that it is more logical."
wordlist = list(chain(*[word_tokenize(s) for s in sent_tokenize(text)]))
print(Counter(wordlist))
# Counter({'.': 2, 'is': 2, 'us': 1, 'more': 1, ',': 1, 'sentences': 1, 'so': 1, 'This': 1, 'an': 1, 'two': 1, 'it': 1, 'example': 1, 'text': 1, 'logical': 1, 'Let': 1, 'that': 1, 'use': 1})

【讨论】：

【解决方案4】：

可以使用多个选项，但我建议您执行以下操作：

替换文本中的特殊字符以使其统一。
拆分已清除的句子。
使用collections.Counter

代码看起来像......

from collections import Counter

my_text = "Lorem ipsum; dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut. labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

special_characters = ',.;'
for char in special_characters:
    my_text = my_text.replace(char, ' ')

print Counter(my_text.split())

我相信更安全的方法是将答案与 NLTK 一起使用，但有时，了解自己在做什么感觉很棒。

【讨论】：

【解决方案5】：

sentence = 'a quick brown fox jumped a another fox'

words = sentence.split(' ')

解决方案1：

result = {i:words.count(i) for i in set(words)}

解决方案2：

result = {}    
for word in words:                                                                                                                                                                                               
    result[word] = result.get(word, 0) + 1

解决方案 3：

from collections import Counter    
result = dict(Counter(words))

【讨论】：

【解决方案6】：

没有必要对句子进行标记。来自Alexander Ejbekov 的回答可以简化为：

from itertools import chain
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize

text = "This is an example text. Let us use two sentences, so that it is more logical."
wordlist = word_tokenize(text) 
print(Counter(wordlist))
# Counter({'is': 2, '.': 2, 'This': 1, 'an': 1, 'example': 1, 'text': 1, 'Let': 1, 'us': 1, 'use': 1, 'two': 1, 'sentences': 1, ',': 1, 'so': 1, 'that': 1, 'it': 1, 'more': 1, 'logical': 1})

【讨论】：