在 python 中使用 NLTK 删除停用词答案

【问题标题】：Removing stopwords using NLTK in python在 python 中使用 NLTK 删除停用词
【发布时间】：2016-11-11 11:27:03
【问题描述】：

我正在使用 NLTK 从列表元素中删除停用词。这是我的代码 sn-p

dict1 = {}
    for ctr,row in enumerate(cur.fetchall()):
            list1 = [row[0],row[1],row[2],row[3],row[4]]
            dict1[row[0]] = list1
            print ctr+1,"\n",dict1[row[0]][2]
            list2 = [w for w in dict1[row[0]][3] if not w in stopwords.words('english')]
            print list2

问题是，这不仅会删除停用词，还会从其他词中删除字符，例如从单词'orientation''i'和更多的停用词将被删除，并且它在list2中存储字符而不是单词。即['O','r','e','n','n','','f','','3','','r','e','r' , 'e', '', 'p', 'n', '\n', '\n', '\n', 'O', 'r', 'e', 'n', 'n' ,'','f','','n','','r','e','r','e','','r','p','l'.. ..................... 而我想将其存储为 ['Orientation','.......

【问题讨论】：

先尝试标记你的话
代码中的 cur 是什么？你能发布更多的上下文代码吗？

标签： python nltk stop-words

【解决方案1】：

首先，您对 list1 的构造对我来说有点奇怪。我认为有一个更 Pythonic 的解决方案：

list1 = row[:5]

那么，您使用 dict1[row[0]][3] 而不是直接使用 row[3] 访问 row[3] 有什么原因吗？

最后，假设 row 是一个字符串列表，从 row[3] 构造 list2 会迭代每个字符，而不是每个单词。这可能就是您解析“i”和“a”（以及其他一些字符）的原因。

正确的理解是：

list2 = [w for w in row[3].split(' ') if w not in stopwords]

您必须以某种方式将字符串分开，可能是围绕空格。这需要从：

'Hello, this is row3'

到

['Hello,', 'this', 'is', 'row3']

迭代得到完整的单词，而不是单个字符。

【讨论】：

TypeError: 'LazyCorpusLoader' 类型的参数不可迭代

【解决方案2】：

首先，确保 list1 是单词列表，而不是字符数组。在这里，我可以给你一个代码 sn-p，你也许可以利用它。

from nltk import word_tokenize
from nltk.corpus import stopwords

english_stopwords = stopwords.words('english')    # get english stop words

# test document
document = '''A moody child and wildly wise
Pursued the game with joyful eyes
'''

# first tokenize your document to a list of words
words = word_tokenize(document)
print(words)

# the remove all stop words
content = [w for w in words if w.lower() not in english_stopwords]
print(content)

输出将是：

['A', 'moody', 'child', 'and', 'wildly', 'wise', 'Pursued', 'the', 'game', 'with', 'joyful', 'eyes']
['moody', 'child', 'wildly', 'wise', 'Pursued', 'game', 'joyful', 'eyes']

【讨论】：