从文档中收集所有 n-gram（及其频率）答案

【问题标题】：Collecting all n-grams (and their frequencies) from document从文档中收集所有 n-gram（及其频率）
【发布时间】：2021-04-06 00:07:32
【问题描述】：

我想从文本中收集所有 n-gram，并且应该计算它们的频率。这两个挑战可以在一个或两个 python 文件中解决。这是我已经拥有的。现在这应该适用于 .txt 文件，而不是放入句子中。

from nltk import ngrams

sentence = 'Hello, this is an example'

n = 3
threegrams = ngrams(sentence.split(), n)

for grams in threegrams:
  print (grams)

【问题讨论】：

你的问题是什么？
this previous SO post 有帮助吗？

标签： python nltk

【解决方案1】：

我找到了一个很好的答案here，可以为您分解。只需一个文件即可实现您的目标。

首先，导入这些nltk 库：

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize

搭配是通常同时出现的多个单词的表达，这就是为什么nltk.collocations 库将帮助找到它们的频率。 word.tokenize 工具只是执行 sentence.split 的另一种方式，它利用 nltk 包中现成的工具。
（如果您收到有关缺少这些包的输出错误，请检查this out）

这是我用来看看我的脚本如何处理三元组的句子：

sentence = "Hello, this is an example. This is an example of the trigram count. The trigram count is neat"

要改为读取 txt 文件，请将该行替换为：

myFile = open("file.txt", 'r').read()

接下来，我们将对每个三元组进行标记和搭配：

trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(sentence)) 
#for txt files: replace the term 'sentence' with 'myFile'

最后，我们打印三元组及其频率：

for i in finder.score_ngrams(trigram_measures.raw_freq):
    print(i)

raw_freq 是TrigramAssocMeasures() class 的一种方法，您可以在其中对三元组应用除频率以外的不同方法。

这是我的输出：

(('is', 'an', 'example'), 0.09523809523809523)
((',', 'this', 'is'), 0.047619047619047616)
(('.', 'The', 'trigram'), 0.047619047619047616)
(('.', 'This', 'is'), 0.047619047619047616)
(('Hello', ',', 'this'), 0.047619047619047616)
(('The', 'trigram', 'count'), 0.047619047619047616)
(('This', 'is', 'an'), 0.047619047619047616)
(('an', 'example', '.'), 0.047619047619047616)
(('an', 'example', 'of'), 0.047619047619047616)
(('count', '.', 'The'), 0.047619047619047616)
(('count', 'is', 'neat'), 0.047619047619047616)
(('example', '.', 'This'), 0.047619047619047616)
(('example', 'of', 'the'), 0.047619047619047616)
(('of', 'the', 'trigram'), 0.047619047619047616)
(('the', 'trigram', 'count'), 0.047619047619047616)
(('this', 'is', 'an'), 0.047619047619047616)
(('trigram', 'count', '.'), 0.047619047619047616)
(('trigram', 'count', 'is'), 0.047619047619047616)

【讨论】：