NLTK 棕色语料库标签答案

【问题标题】：NLTK Brown Corpus TagsNLTK 棕色语料库标签
【发布时间】：2014-12-02 19:42:07
【问题描述】：

当我打印nltk.corpus.brown.tagged_words() 时，它会打印大约 1161192 个带有单词及其相关标签的元组。

我想区分具有不同不同标签的不同不同单词。一个词可以有多个标签。

Append list items by number of hyphens available 我用这个线程尝试了每一个代码，但我没有得到超过 3 个标签的单词。据我所知，有些词甚至有 8 个或 9 个标签。

我的方法哪里错了？如何解决这个问题？我有两个不同的问题：

如何计算不同不同标签下语料库中不同单词的计数？语料库中不同单词的数量，假设有 8 个不同的标签。
再次，我想知道具有最多不同标签的单词。

而且，我只对文字感兴趣。我正在删除标点符号。

【问题讨论】：

标签： python nlp nltk corpus

【解决方案1】：

使用defaultdict(Counter) 跟踪单词及其词性。然后按key的len(Counter)对字典进行排序：

from collections import defaultdict, Counter
from nltk.corpus import brown

# Keeps words and pos into a dictionary 
# where the key is a word and
# the value is a counter of POS and counts
word_tags = defaultdict(Counter)
for word, pos in brown.tagged_words():
    word_tags[word][pos] +=1

# To access the POS counter.    
print 'Red', word_tags['Red']
print 'Marlowe', word_tags['Marlowe']
print

# Greatest number of distinct tag.
word_with_most_distinct_pos = sorted(word_tags, key=lambda x: len(word_tags[x]), reverse=True)[0]

print word_with_most_distinct_pos
print word_tags[word_with_most_distinct_pos]
print len(word_tags[word_with_most_distinct_pos])

[出]：

Red Counter({u'JJ-TL': 49, u'NP': 21, u'JJ': 3, u'NN-TL': 1, u'JJ-TL-HL': 1})
Marlowe Counter({u'NP': 4})

that
Counter({u'CS': 6419, u'DT': 1975, u'WPS': 1638, u'WPO': 135, u'QL': 54, u'DT-NC': 6, u'WPS-NC': 3, u'CS-NC': 2, u'WPS-HL': 2, u'NIL': 1, u'CS-HL': 1, u'WPO-NC': 1})
12

获取带有 X 的单词。不同的POS：

# Words with 8 distinct POS
word_with_eight_pos = filter(lambda x: len(word_tags[x]) == 8, word_tags.keys())

for i in word_with_eight_pos:
    print i, word_tags[i]
print 

# Words with 9 distinct POS
word_with_nine_pos = filter(lambda x: len(word_tags[x]) == 9, word_tags.keys())

for i in word_with_nine_pos:
    print i, word_tags[i]

[出]：

a Counter({u'AT': 21824, u'AT-HL': 40, u'AT-NC': 7, u'FW-IN': 4, u'NIL': 3, u'FW-IN-TL': 1, u'AT-TL': 1, u'NN': 1})

: Counter({u':': 1558, u':-HL': 138, u'.': 46, u':-TL': 22, u'IN': 20, u'.-HL': 8, u'NIL': 1, u',': 1, u'NP': 1})

【讨论】：

太棒了！非常感谢阿尔瓦斯！我能再问你一个问题吗？所以“that”是标签最多的词，共有12个。我们如何从语料库中打印带有单词“that”的每个标签的句子？
你的意思是[i for i in brown.sents() if 'that' in i]？
不不...“那个”字有 12 个标签。所以我试图从包含单词的语料库中打印句子，每个可能的标签一个。例如，从语料库中举一个例子，其中句子中有单词“that”，POS 标记为 WPS。然后是带有标签 WPO 的第二个例句。 .. 同样打印 12 个句子。
word_with_12_pos = filter(lambda x: len(word_tags[x]) == 12, word_tags.keys())
然后[[j for j in brown.tagged_sents() if ('the',i) in j] for i in word_with_12_pos]

【解决方案2】：

您可以使用itertools.groupby 来实现您想要的。请注意，以下代码只是快速组合在一起，很可能不是实现目标的最有效方式（我会留给您优化它），但是它确实可以完成工作......

import itertools
import operator

import nltk

for k, g in itertools.groupby(sorted(nltk.corpus.brown.tagged_words()), key=operator.itemgetter(0)):
    print k, set(map(operator.itemgetter(1), g))

输出：

...
yonder set([u'RB'])
yongst set([u'JJT'])
yore set([u'NN', u'PP$'])
yori set([u'FW-NNS'])
you set([u'PPSS-NC', u'PPO', u'PPSS', u'PPO-NC', u'PPO-HL', u'PPSS-HL'])
you'd set([u'PPSS+HVD', u'PPSS+MD'])
you'll set([u'PPSS+MD'])
you're set([u'PPSS+BER'])
...

【讨论】：

【解决方案3】：

查找具有最多不同标签（及其标签）的单词的两行方法：

word2tags = nltk.Index(set(nltk.corpus.brown.tagged_words()))
print(max(word2tags.items(), key=lambda wt: len(wt[1])))

【讨论】：

【解决方案4】：

NLTK 提供了完美的工具来索引用于每个单词的所有标签：

wordtags = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words())

或者，如果您想在进行时对单词进行大小写：

wordtags = nltk.ConditionalFreqDist((w.lower(), t) for w, t in brown.tagged_words())

我们现在有了属于每个单词的标签的索引（加上它们的频率，OP 并不关心）：

>>> print(wordtags["clean"].items())
dict_items([('JJ', 48), ('NN-TL', 1), ('RB', 1), ('VB-HL', 1), ('VB', 18)])

要查找标签最多的单词，请使用常规 Python 排序：

>>> wtlist = sorted(wordtags.items(), key=lambda x: len(x[1]), reverse=True)
>>> for word, freqs in wtlist[:10]:
        print(word, "\t", len(freqs), list(freqs))

that     15 ['DT', 'WPS-TL', 'CS-NC', 'DT-NC', 'WPS-NC', 'WPS', 'NIL', 'CS-HL', 'WPS-HL', 
             'WPO-NC', 'DT-TL', 'DT-HL', 'CS', 'QL', 'WPO']
a        13 ['NN-TL', 'AT-NC', 'NP', 'AT', 'AT-TL-HL', 'NP-HL', 'NIL', 'AT-TL', 'NN', 
             'NP-TL', 'AT-HL', 'FW-IN-TL', 'FW-IN']
(etc.)

【讨论】：