【问题标题】:Remove stopwords from words frequency从词频中删除停用词
【发布时间】:2020-06-13 08:40:01
【问题描述】:

我正在尝试从这些数据中删除停用词

     DateTime             Clean 
    2020-01-07             then       28
                            and       28
                          pizza       14
                        capital       14

    ... ... ...
    2020-03-31          college       14
                        included      14
                          of          14
    ...........

数据来自

df4.groupby('DateTime').agg({'Clean': 'value_counts'}).rename(columns={'Clean': 'Count'}).groupby('DateTime').head(4)

如何从这个频率列表中删除这些停用词?

分组前的数据样本(原始数据):

Text                                                     Clean
all information regarding the state of art ...       [all, information, regarding, the, state, of, art ...
all information regarding the state of art ...       [all, information, regarding, the, state, of, art ...
to get a good result you should ...     [to, get, a, good ,...

第一个是我需要标记的文本。 Clean 应该包含每个文本的标记化。我需要按日期时间查看单词的频率,如下所示,但不包括停用词。

【问题讨论】:

  • 这能回答你的问题吗? Python remove stop words from pandas dataframe
  • 我做了remove_words = list(stopwords.words('english'))+list(more_stop)df4.Clean=df4.Clean.apply(lambda x: list(word for word in x.split() if word not in remove_words))

标签: python pandas


【解决方案1】:
  • 使用来自nltk 的停用词
    • 它们作为列表加载
  • 通过import nltknltk.download() 更新nltk 集合
import pandas as pd
from nltk.corpus import stopwords

# stop words list
stop = stopwords.words('english')

# data and dataframe
data = {'Text': ['all information regarding the state of art',
                 'all information regarding the state of art',
                 'to get a good result you should'],
        'DateTime': ['2020-01-07', '2020-02-04', '2020-03-06']}

df = pd.DataFrame(data)

# all strings to lowercase, strip whitespace from the ends, and split on space
df.Text = df.Text.str.lower().str.strip().str.split()

# remove stop words from Text
df['Clean'] = df.Text.apply(lambda x: [w.strip() for w in x if w.strip() not in stop])

# explode lists
df = df.explode('Clean')

# groupby DateTime and Clean
dfg = df.groupby(['DateTime', 'Clean']).agg({'Clean': 'count'})

                        Clean
DateTime   Clean             
2020-01-07 art              1
           information      1
           regarding        1
           state            1
2020-02-04 art              1
           information      1
           regarding        1
           state            1
2020-03-06 get              1
           good             1
           result           1

【讨论】:

    猜你喜欢
    • 2018-09-28
    • 2013-05-12
    • 1970-01-01
    • 2021-02-02
    • 2018-01-15
    • 2015-01-20
    • 2018-05-26
    • 2016-05-21
    • 1970-01-01
    相关资源
    最近更新 更多