【发布时间】:2018-01-28 17:26:05
【问题描述】:
我有一个文本文档,我正在使用 regex 和 nltk 来查找该文档中最常见的 5 单词。我必须打印出这些单词所属的句子,我该怎么做?此外,我想将其扩展到在多个文档中查找常用词并返回它们各自的句子。
import nltk
import collections
from collections import Counter
import re
import string
frequency = {}
document_text = open('test.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) #return all the words with the number of characters in the range [3-15]
fdist = nltk.FreqDist(match_pattern) # creates a frequency distribution from a list
most_common = fdist.max() # returns a single element
top_five = fdist.most_common(5)# returns a list
list_5=[word for (word, freq) in fdist.most_common(5)]
print(top_five)
print(list_5)
输出:
[('you', 8), ('tuples', 8), ('the', 5), ('are', 5), ('pard', 5)]
['you', 'tuples', 'the', 'are', 'pard']
输出是最常出现的单词我必须打印这些单词所属的句子,我该怎么做?
【问题讨论】: