NLTK 提供了完美的工具来索引用于每个单词的所有标签:
wordtags = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words())
或者,如果您想在进行时对单词进行大小写:
wordtags = nltk.ConditionalFreqDist((w.lower(), t) for w, t in brown.tagged_words())
我们现在有了属于每个单词的标签的索引(加上它们的频率,OP 并不关心):
>>> print(wordtags["clean"].items())
dict_items([('JJ', 48), ('NN-TL', 1), ('RB', 1), ('VB-HL', 1), ('VB', 18)])
要查找标签最多的单词,请使用常规 Python 排序:
>>> wtlist = sorted(wordtags.items(), key=lambda x: len(x[1]), reverse=True)
>>> for word, freqs in wtlist[:10]:
print(word, "\t", len(freqs), list(freqs))
that 15 ['DT', 'WPS-TL', 'CS-NC', 'DT-NC', 'WPS-NC', 'WPS', 'NIL', 'CS-HL', 'WPS-HL',
'WPO-NC', 'DT-TL', 'DT-HL', 'CS', 'QL', 'WPO']
a 13 ['NN-TL', 'AT-NC', 'NP', 'AT', 'AT-TL-HL', 'NP-HL', 'NIL', 'AT-TL', 'NN',
'NP-TL', 'AT-HL', 'FW-IN-TL', 'FW-IN']
(etc.)