如何从语料库中获取最常用的词？答案

【问题标题】：How to get the most frequent word from a corpus?如何从语料库中获取最常用的词？
【发布时间】：2017-03-03 08:35:29
【问题描述】：

我正在使用语料库，并希望从语料库中获取最常用和最少使用的单词和单词类。我有一个代码的开头，但是我遇到了一些我不知道如何处理的错误。我想从棕色语料库中取出最常用的词，然后是最常用和最少使用的词类。我有这个代码：

import re
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords
from collections import defaultdict, Counter
from nltk.corpus import brown

brown = nltk.corpus.brown
stoplist = stopwords.words('english')

from collections import defaultdict

def toptenwords(brown):
    words = brown.words()
    no_capitals = ([word.lower() for word in words])
    filtered = [word for word in no_capitals if word not in stoplist]
    translate_table = dict((ord(char), None) for char in string.punctuation)
    no_punct = [s.translate(translate_table) for s in filtered]
    wordcounter = defaultdict(int)
    for word in no_punct:
        if word in wordcounter:
            wordcounter[word] += 1
        else:
            wordcounter[word] = 1
    sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
    return sorting

print(toptenwords(brown))

words_2 = [word[0] for word in brown.tagged_words(categories="news")]
# the most frequent words
print Counter(words_2).most_common(10)

words_2 = [word[1] for word in brown.tagged_words(categories="news")]
# the most frequent word class
print Counter(words_2).most_common(10)


# Keeps words and pos into a dictionary
# where the key is a word and
# the value is a counter of POS and counts
word_tags = defaultdict(Counter)
for word, pos in brown.tagged_words():
word_tags[word][pos] +=1

# To access the POS counter.
print 'Red', word_tags['Red']
print 'Marlowe', word_tags['Marlowe']
print

# Greatest number of distinct tag.
word_with_most_distinct_pos = sorted(word_tags, key=lambda x: len(word_tags[x]), reverse=True)[0]

print word_with_most_distinct_pos
print word_tags[word_with_most_distinct_pos]
print len(word_tags[word_with_most_distinct_pos])

# which word has the greatest number of distinct tags
word_tags_2 = nltk.defaultdict(lambda: set())
for word, token in tagged_words:
    word_tags[word].add(token)
    ambig_words = sorted([(k, len(v)) for (k, v) in word_tags.items()]),
    key=itemgetter(1), reverse=True)[:50]
  print [(word, numtoks, word_tags[word]) for (word, numtoks) in ambig_words]

当我运行上面的代码时，我得到以下错误：

File "Oblig2a.py", line 64
    key=itemgetter(1), reverse=True)[:50]
                               ^
SyntaxError: invalid syntax

我想从这段代码中得到：

最常见的词
最常用词类
最不常见的词类
多于一个词类的词数
哪个词的标签最多，有多少不同的标签
我需要帮助的最后一件事是为特定单词编写一个函数，并写下它与每个标签一起出现的次数。我正在尝试在上面这样做，但我无法让它工作......

它是 3、4、5 和 6 号，我需要帮助... 任何帮助都将受到欢迎。

【问题讨论】：

查看堆栈跟踪。违规行显然是stoplist = stopwords.words(brown)。此方法需要文件 ID，但不需要标记词序列（这是您分配给变量 brown 的内容）。
如何更改？
您应该为函数提供语言的名称，例如stoplist = stopwords.words('english')
现在它运行良好，但我不确定如何从输出中打印我想要的内容...我尝试了多个位置和方法，但我没有打印任何内容...跨度>
Vebjørn，看看你定义 no_capitals 的那一行，想想它做了什么，以及这会如何影响你counting字的目标。跨度>

标签： python python-2.7 nltk counter corpus

【解决方案1】：

代码有3个问题：

解释器告诉您的错误 - 您应该向停用词函数提供语言名称：stoplist = stopwords.words('english')
使用defaultdict字典get方法正确排序字典： [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
对 Unicode 数据使用翻译表，请参阅string.translate() with unicode data in python
棕色标记的单词是格式为(word, part-of-speech) 的元组

完整代码：

import re
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords

brown = nltk.corpus.brown
stoplist = stopwords.words('english')

from collections import defaultdict

def toptenwords(brown):
    words = brown.words()
    no_capitals = set([word.lower() for word in words])
    filtered = [word for word in no_capitals if word not in stoplist]
    translate_table = dict((ord(char), None) for char in string.punctuation)
    no_punct = [s.translate(translate_table) for s in filtered]
    wordcounter = defaultdict(int)
    for word in no_punct:
        if word in wordcounter:
            wordcounter[word] += 1
        else:
            wordcounter[word] = 1
    sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
    return sorting


print(toptenwords(brown))

words_2 = [word[0] for word in brown.tagged_words(categories="news")]
# the most frequent words
print Counter(words_2).most_common(10)

words_2 = [word[1] for word in brown.tagged_words(categories="news")]
# the most frequent word class
print Counter(words_2).most_common(10)

【讨论】：

谢谢！但是我如何从这段代码中获得最少使用的单词和单词类？
查看本主题stackoverflow.com/questions/4743035/…
@VebjørnBergaplass，要使用 nltk，您需要能够稍微编程。您需要将“我没有得到我想要的输出”缩小到一个编程问题。
对不起。当我运行编辑后的代码（带打印）时，我没有得到输出。我试图运行它并将打印放在最后，但什么也没有......