Python从熊猫数据框中删除停用词答案

【问题标题】：Python remove stop words from pandas dataframePython从熊猫数据框中删除停用词
【发布时间】：2021-08-13 22:08:21
【问题描述】：

我想从“推文”列中删除停用词。如何迭代每一行和每一项？

pos_tweets = [('I love this car', 'positive'),
    ('This view is amazing', 'positive'),
    ('I feel great this morning', 'positive'),
    ('I am so excited about the concert', 'positive'),
    ('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]
test["tweet"] = test["tweet"].str.lower().str.split()

from nltk.corpus import stopwords
stop = stopwords.words('english')

【问题讨论】：

旧帖子，但作为参考，停用词将包含“i”和“a”之类的词。 @EdChum 你的代码会因此毁掉所有的单词
@user3120554 或许您可以根据空格和长度对停用词进行排序。

标签： python pandas

【解决方案1】：

使用列表理解

test['tweet'].apply(lambda x: [item for item in x if item not in stop])

0               [love, car]
1           [view, amazing]
2    [feel, great, morning]
3        [excited, concert]
4            [best, friend]

【讨论】：

这不会维护字符串，因此一旦删除停用词，您将无法搜索单词组合。上面 Ed Chum 的评论保留了字符串。
我需要添加 str(x).split() 并将成为 test['tweet'].apply(lambda x: [item for item in str(x).split() if item not in stopwords.words('spanish')]) 因为显示一个错误，指出 'float' 对象不可迭代
@Alex Montoya，我找到了这个问题和答案：我正在尝试应用您的建议，但我得到的是空栏：df['tweet'] = df['tweet'].apply(lambda x: [item for item in str(x).split() if item not in stop])。你知道它会导致什么吗？（我想避免重复的问题）非常感谢

【解决方案2】：

查看 pd.DataFrame.replace()，它可能对你有用：

In [42]: test.replace(to_replace='I', value="",regex=True)
Out[42]:
                              tweet     class
0                     love this car  positive
1              This view is amazing  positive
2           feel great this morning  positive
3   am so excited about the concert  positive
4              He is my best friend  positive

编辑：replace() 会搜索字符串（甚至是子字符串）。例如如果 rk 是一个有时不是预期的停用词，它将替换 work 中的 rk。

因此在这里使用regex：

for i in stop :
    test = test.replace(to_replace=r'\b%s\b'%i, value="",regex=True)

【讨论】：

效果很好！只是想用更多案例更新答案

【解决方案3】：

我们可以从nltk.corpus 导入stopwords，如下所示。有了这个，我们排除了 Python 的列表理解和 pandas.DataFrame.apply 的停用词。

# Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

pos_tweets = [('I love this car', 'positive'),
    ('This view is amazing', 'positive'),
    ('I feel great this morning', 'positive'),
    ('I am so excited about the concert', 'positive'),
    ('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]

# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test)
# Out[40]:
#                                tweet     class tweet_without_stopwords
# 0                    I love this car  positive              I love car
# 1               This view is amazing  positive       This view amazing
# 2          I feel great this morning  positive    I feel great morning
# 3  I am so excited about the concert  positive       I excited concert
# 4               He is my best friend  positive          He best friend

也可以使用pandas.Series.str.replace排除。

pat = r'\b(?:{})\b'.format('|'.join(stop))
test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '')
test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r'\s+', ' ')
# Same results.
# 0              I love car
# 1       This view amazing
# 2    I feel great morning
# 3       I excited concert
# 4          He best friend

如果不能导入停用词，可以如下下载。

import nltk
nltk.download('stopwords')

另一种回答方法是从sklearn.feature_extraction 导入text.ENGLISH_STOP_WORDS。

# Import stopwords with scikit-learn
from sklearn.feature_extraction import text
stop = text.ENGLISH_STOP_WORDS

请注意，scikit-learn 停用词和 nltk 停用词中的单词数量不同。

【讨论】：

"r'\b(?:{})\b'" 是做什么的？
如果数据框中的列不止一列怎么办？当我尝试将此应用于多个列时，我得到一个 KeyError
当我尝试运行此代码时出现错误AttributeError: Can only use .str accessor with string values!
几乎对我有用，除了我必须将 x 包装在 str() 中，如 如果单词不在 stop 中，则在 str(x).split() 中逐字逐句 .我正在使用熊猫 1.1.2 和 Python 3.8.5。

【解决方案4】：

如果你想要一些简单的东西，但没有得到单词列表：

test["tweet"].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stop))

stop 的定义与 OP 相同。

from nltk.corpus import stopwords
stop = stopwords.words('english')

【讨论】：