从 NLTK 分布中删除停用词之外的特定词答案

【问题标题】：Dropping specific words out of an NLTK distribution beyond stopwords从 NLTK 分布中删除停用词之外的特定词
【发布时间】：2015-08-05 08:14:37
【问题描述】：

我有一个像这样的简单句子。我想从列表中删除诸如A 和IT 之类的介词和单词。我查看了自然语言工具包 (NLTK) 文档，但找不到任何东西。有人可以告诉我怎么做吗？这是我的代码：

import nltk
from nltk.tokenize import RegexpTokenizer
test = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
test = test.upper()
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
fdist = nltk.FreqDist(tokens)
common = fdist.most_common(100)

【问题讨论】：

我猜你的意思是介词，而不是命题？
Stopword removal with NLTK 的可能重复项

标签： python list nltk

【解决方案1】：

stopwords 可能是您正在寻找的解决方案吗？

您可以很容易地从标记化文本中过滤它们：

from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

en_stopws = stopwords.words('english')  # this loads the default stopwords list for English
en_stopws.append('spam')  # add any words you don't like to the list

test = "Hello, this is my sentence. It is a very basic sentence with not much information in it but a lot of spam"
test = test.lower()  # I changed it to lower(), since stopwords are all lower case
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
tokens = [token for token in tokens if token not in en_stopws]  # filter stopwords
fdist = FreqDist(tokens)
common = fdist.most_common(100)

我没有找到从FreqDist 中删除条目的好方法，如果您发现了什么，请告诉我。

【讨论】：

我收到了一个回溯错误...File "C:\Python27\lib\site-packages\nltk\data.py", line 293, in __init__ raise IOError('No such file or directory: %r' % _path) IOError: No such file or directory: u'C:\\Users\\jason\\AppData\\Roaming\\nltk_data\\corpora\\stopwords\\IS' 使用了 IS 这个词，但我看到了您在进入之前进行过滤的方法
@jason_cant_code 我认为您误解了停用词语料库的加载。我编辑并试图让它更清楚一点。另请查看book 了解更多信息

【解决方案2】：

本质上，nltk.probability.FreqDist 是一个 collections.Counter 对象 (https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L61)。给定一个字典对象，有几种过滤方法：

1.读入 FreqDist 并使用 lambda 函数对其进行过滤

>>> import nltk
>>> text = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
>>> tokenized_text = nltk.word_tokenize(text)
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> word_freq = nltk.FreqDist(tokenized_text)
>>> dict_filter = lambda word_freq, stopwords: dict( (word,word_freq[word]) for word in word_freq if word not in stopwords )
>>> filtered_word_freq = dict_filter(word_freq, stopwords)
>>> len(word_freq)
17
>>> len(filtered_word_freq)
8
>>> word_freq
FreqDist({'sentence': 2, 'is': 2, 'a': 1, 'information': 1, 'this': 1, 'with': 1, 'in': 1, ',': 1, '.': 1, 'very': 1, ...})
>>> filtered_word_freq
{'information': 1, 'sentence': 2, ',': 1, '.': 1, 'much': 1, 'basic': 1, 'It': 1, 'Hello': 1}

2。读入 FreqDist 并使用字典理解对其进行过滤

>>> word_freq
FreqDist({'sentence': 2, 'is': 2, 'a': 1, 'information': 1, 'this': 1, 'with': 1, 'in': 1, ',': 1, '.': 1, 'very': 1, ...})
>>> filtered_word_freq = dict((word, freq) for word, freq in word_freq.items() if word not in stopwords)
>>> filtered_word_freq 
{'information': 1, 'sentence': 2, ',': 1, '.': 1, 'much': 1, 'basic': 1, 'It': 1, 'Hello': 1}

3.在读入 FreqDist 之前过滤单词

>>> import nltk
>>> text = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
>>> tokenized_text = nltk.word_tokenize(text)
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> filtered_tokenized_text = [word for word in tokenized_text if word not in stopwords]
>>> filtered_word_freq = nltk.FreqDist(filtered_tokenized_text)
>>> filtered_word_freq
FreqDist({'sentence': 2, 'information': 1, ',': 1, 'It': 1, '.': 1, 'much': 1, 'basic': 1, 'Hello': 1})

【讨论】：

在第 1 步中不太明白第二次过滤是如何减少字数的。我认为这两个过滤器都在从唯一词的字典中删除停用词，对吗？