python nltk处理文本，快速删除停用词[重复]答案

【问题标题】：python nltk processing with text, remove stopwords quickly [duplicate]python nltk处理文本，快速删除停用词[重复]
【发布时间】：2019-03-14 13:15:54
【问题描述】：

我正在使用 nltk 来处理文本数据。当我想使用停用词时，我通常使用此代码。

text_clean = [w for w in text if w.lower() not in stopwords]

但是这段代码总是耗时太长。（也许我的数据太大了……）
有什么方法可以减少时间吗？谢谢。

【问题讨论】：

停用词是一个集合还是一个列表？
对不起，是列表。

标签： python text nltk

【解决方案1】：

尝试将stopwords 转换为集合。使用列表，您的方法是O(n*m)，其中n 是文本中的单词数，m 是停用词的数量，使用set 方法是O(n + m)。让我们比较list 和set 这两种方法：

import timeit
from nltk.corpus import stopwords


def list_clean(text):
    stop_words = stopwords.words('english')
    return [w for w in text if w.lower() not in stop_words]


def set_clean(text):
    set_stop_words = set(stopwords.words('english'))
    return [w for w in text if w.lower() not in set_stop_words]

text = ['the', 'cat', 'is', 'on', 'the', 'table', 'that', 'is', 'in', 'some', 'room'] * 100000

if __name__ == "__main__":
    print(timeit.timeit('list_clean(text)', 'from __main__ import text,list_clean', number=5))
    print(timeit.timeit('set_clean(text)', 'from __main__ import text,set_clean', number=5))

输出

7.6629380420199595
0.8327891009976156

在上面的代码中，list_clean 是一个使用list 删除停用词的函数，set_clean 是一个使用set 删除停用词的函数。第一次对应list_clean，第二次对应set_clean。对于给定的示例，set_clean 几乎快 10 倍。

更新

O(n*m) 和O(n + m) 是big o notation 的示例，这是一种衡量算法效率的理论方法。基本上，多项式越大，算法效率越低，在这种情况下，O(n*m) 大于O(n + m)，因此list_clean 方法理论上比set_clean 方法效率低。这个数字来自这样一个事实：在列表中搜索是 O(n)，而在 set 中搜索需要固定的时间，通常称为 O(1)。

【讨论】：

谢谢，但我不明白 0(n*m) 和 o(n+m) 的意思。你能告诉我它是什么吗？
我更新了答案。