从词频中删除停用词答案

【问题标题】：Remove stopwords from words frequency从词频中删除停用词
【发布时间】：2020-06-13 08:40:01
【问题描述】：

我正在尝试从这些数据中删除停用词

     DateTime             Clean 
    2020-01-07             then       28
                            and       28
                          pizza       14
                        capital       14

    ... ... ...
    2020-03-31          college       14
                        included      14
                          of          14
    ...........

数据来自

df4.groupby('DateTime').agg({'Clean': 'value_counts'}).rename(columns={'Clean': 'Count'}).groupby('DateTime').head(4)

如何从这个频率列表中删除这些停用词？

分组前的数据样本（原始数据）：

Text                                                     Clean
all information regarding the state of art ...       [all, information, regarding, the, state, of, art ...
all information regarding the state of art ...       [all, information, regarding, the, state, of, art ...
to get a good result you should ...     [to, get, a, good ,...

第一个是我需要标记的文本。 Clean 应该包含每个文本的标记化。我需要按日期时间查看单词的频率，如下所示，但不包括停用词。

【问题讨论】：

这能回答你的问题吗？ Python remove stop words from pandas dataframe
我做了remove_words = list(stopwords.words('english'))+list(more_stop) 和df4.Clean=df4.Clean.apply(lambda x: list(word for word in x.split() if word not in remove_words))

标签： python pandas

【解决方案1】：

使用来自nltk 的停用词
- 它们作为列表加载
通过import nltk 和nltk.download() 更新nltk 集合

import pandas as pd
from nltk.corpus import stopwords

# stop words list
stop = stopwords.words('english')

# data and dataframe
data = {'Text': ['all information regarding the state of art',
                 'all information regarding the state of art',
                 'to get a good result you should'],
        'DateTime': ['2020-01-07', '2020-02-04', '2020-03-06']}

df = pd.DataFrame(data)

# all strings to lowercase, strip whitespace from the ends, and split on space
df.Text = df.Text.str.lower().str.strip().str.split()

# remove stop words from Text
df['Clean'] = df.Text.apply(lambda x: [w.strip() for w in x if w.strip() not in stop])

# explode lists
df = df.explode('Clean')

# groupby DateTime and Clean
dfg = df.groupby(['DateTime', 'Clean']).agg({'Clean': 'count'})

                        Clean
DateTime   Clean             
2020-01-07 art              1
           information      1
           regarding        1
           state            1
2020-02-04 art              1
           information      1
           regarding        1
           state            1
2020-03-06 get              1
           good             1
           result           1

【讨论】：