在 Python 中删除停用词的更快方法答案

【问题标题】：Faster way to remove stop words in Python在 Python 中删除停用词的更快方法
【发布时间】：2013-11-02 20:09:08
【问题描述】：

我正在尝试从文本字符串中删除停用词：

from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

我正在处理 600 万个这样的字符串，所以速度很重要。分析我的代码，最慢的部分是上面的行，有没有更好的方法来做到这一点？我正在考虑使用正则表达式re.sub 之类的东西，但我不知道如何为一组单词编写模式。谁能帮帮我，我也很高兴听到其他可能更快的方法。

注意：我尝试过有人建议用 set() 包装 stopwords.words('english')，但这没有任何区别。

谢谢。

【问题讨论】：

stopwords.words('english') 有多大？
@SteveBarnes 127 个单词的列表
你是把它包在列表理解里面还是外面？尝试添加 stw_set = set(stopwords.words('english')) 并改用这个对象
@alko 我以为我把它包在外面没有效果，但我又试了一次，我的代码现在运行速度至少快了 10 倍！！！
你是逐行处理文本还是一起处理？

标签： python regex stop-words

【解决方案1】：

尝试缓存停用词对象，如下所示。每次调用函数时都构建它似乎是瓶颈。

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

我通过分析器运行了这个：python -m cProfile -s 累积 test.py。相关行贴在下面。

nCalls 累计时间

10000 7.723 字.py:7(testFuncOld)

10000 0.140 字.py:11(testFuncNew)

因此，缓存停用词实例可以提高约 70 倍的速度。

【讨论】：

同意。性能提升来自缓存停用词，而不是真正创建set。
当然，您不必每次都从磁盘读取列表，因为这是最耗时的操作。但是如果你现在把你的“缓存”列表变成一个集合（当然只有一次），你会得到另一个提升。
谁能告诉我这是否支持日语？
它给了我这个 UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - 将它们解释为不相等 text=' '.join([word for word in text.split() if word not在 stop_words]) 请所罗门为我提供解决方案
它是否也适用于 pandas 数据帧（加速）df['column']=df['column'].apply(lambda x: [item for item in x if item not in cachedStopWords])

【解决方案2】：

使用正则表达式删除所有不匹配的单词：

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)

这可能会比循环自己快方式，尤其是对于大型输入字符串。

如果文本中的最后一个单词被此删除，您可能会有尾随空格。我建议单独处理。

【讨论】：

知道这会有什么复杂性吗？如果 w = 我的文本中的单词数和 s = 停止列表中的单词数，我认为循环将按 w log s 的顺序排列。在这种情况下，w 大约是 s，所以它是 w log w。 grep 会不会更慢，因为它（大致）必须逐个字符匹配？
其实我觉得 O(...) 含义的复杂性是一样的。两者都是O(w log s)，是的。 BUT 正则表达式在低得多的级别上实现并进行了大量优化。单词拆分已经导致复制所有内容、创建字符串列表和列表本身，所有这些都需要宝贵的时间。
这种方法比分割行、词标记化、然后检查停用词集中的每个词快很多。特别是对于较大的文本输入

【解决方案3】：

抱歉回复晚了。对新用户很有用。

使用集合库创建停用词词典

使用该字典进行非常快速的搜索（时间 = O(1)），而不是在列表中进行搜索（时间 = O(stopwords)）

from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
text = ' '.join([word for word in text.split() if word not in stopwords_dict])

【讨论】：

即使与基于正则表达式的方法相比，这确实可以大大加快速度。
这确实是一个很好的答案，我希望这更多。令人难以置信的是，从 20k 个项目列表中删除文本的速度有多快。常规方式耗时 1 多小时，而使用 Counter 耗时 20 秒。
你能解释一下'Counter'是如何加快这个过程的吗？ @Gulshan Jangid
好吧，上面的代码速度很快的主要原因是我们在一个基本上是哈希图的字典中搜索。在 hashmap 中，搜索时间是 O(1)。除此之外，Counter 是集合库的一部分，并且库是用 C 编写的，由于 C 比 python 快得多，因此 Counter 比用 python 编写的类似代码快
刚刚对此进行了测试，它平均比正则表达式方法快 3 倍。一个简单而富有创意的解决方案，是目前最好的解决方案。

【解决方案4】：

首先，您要为每个字符串创建停用词。创建一次。在这里设置确实很棒。

forbidden_words = set(stopwords.words('english'))

稍后，在join 中删除[]。请改用生成器。

替换

' '.join([x for x in ['a', 'b', 'c']])

与

' '.join(x for x in ['a', 'b', 'c'])

接下来要处理的是使.split() 产生值而不是返回一个数组。 ~~我相信 regex 会是很好的替代品。~~ 请参阅 thist hread 了解为什么 s.split() 实际上很快。

最后，并行执行这样的工作（删除 6m 字符串中的停用词）。那是一个完全不同的话题。

【讨论】：

我怀疑使用正则表达式会有所改进，请参阅stackoverflow.com/questions/7501609/python-re-split-vs-split/…
刚才也找到了。 :)
谢谢。 set 的速度至少提高了 8 倍。为什么使用生成器有帮助？ RAM 对我来说不是问题，因为每段文本都很小，大约 100-200 个字。
实际上，我已经看到join 在列表推导中的表现比等效的生成器表达式更好。
设置差异似乎也可以工作clean_text = set(text.lower().split()) - set(stopwords.words('english'))

【解决方案5】：

尝试通过避免循环来使用它，而是使用正则表达式来删除停用词：

import re
from nltk.corpus import stopwords

cachedStopWords = stopwords.words("english")
pattern = re.compile(r'\b(' + r'|'.join(cachedStopwords) + r')\b\s*')
text = pattern.sub('', text)

【讨论】：

【解决方案6】：

到目前为止，仅使用常规 dict 似乎是最快的解决方案。
甚至超过 Counter 解决方案约 10%

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'
text = " ".join([word for word in text.split() if word not in stopwords_dict])

使用 cProfile 分析器测试

您可以在此处找到使用的测试代码： https://gist.github.com/maxandron/3c276924242e7d29d9cf980da0a8a682

编辑：

最重要的是，如果我们用循环替换列表推导式，我们的性能会再提高 20%

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'

new = ""
for word in text.split():
    if word not in stopwords_dict:
        new += word
text = new

【讨论】：