Python nltk 计算单词和短语频率答案

【问题标题】：Python nltk counting word and phrase frequencyPython nltk 计算单词和短语频率
【发布时间】：2016-11-19 21:22:26
【问题描述】：

我正在使用 NLTK 并尝试将特定文档的单词短语计数增加到一定长度以及每个短语的频率。我对字符串进行标记以获取数据列表。

from nltk.util import ngrams
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.collocations import *


data = ["this", "is", "not", "a", "test", "this", "is", "real", "not", "a", "test", "this", "is", "this", "is", "real", "not", "a", "test"]

bigrams = ngrams(data, 2)

bigrams_c = {}
for b in bigrams:
    if b not in bigrams_c:
        bigrams_c[b] = 1
    else:
        bigrams_c[b] += 1

上面的代码给出并输出如下：

(('is', 'this'), 1)
(('test', 'this'), 2)
(('a', 'test'), 3)
(('this', 'is'), 4)
(('is', 'not'), 1)
(('real', 'not'), 2)
(('is', 'real'), 2)
(('not', 'a'), 3)

这部分是我正在寻找的。p>

我的问题是，有没有更方便的方法可以说最多 4 或 5 个长度的短语，而无需复制此代码只是为了更改计数变量？

【问题讨论】：

标签： python nltk word-frequency

【解决方案1】：

既然你标记了这个nltk，下面是如何使用nltk 的方法来做到这一点，它比标准python 集合中的功能更多。

from nltk import ngrams, FreqDist
all_counts = dict()
for size in 2, 3, 4, 5:
    all_counts[size] = FreqDist(ngrams(data, size))

字典all_counts 的每个元素都是ngram 频率的字典。例如，您可以像这样得到五个最常见的三元组：

all_counts[3].most_common(5)

【讨论】：

天哪，这比我之前写的要好得多。非常感谢，非常棒的回答！

【解决方案2】：

是的，不要运行这个循环，使用collections.Counter(bigrams) 或pandas.Series(bigrams).value_counts() 来计算单行中的计数。

【讨论】：