从数据框中删除标点符号和停用词答案

【问题标题】：Remove punctuation and stop words from a data frame从数据框中删除标点符号和停用词
【发布时间】：2020-10-04 16:33:44
【问题描述】：

我的数据框看起来像 -

State                           text
Delhi                  170 kw for330wp, shipping and billing in delhi...
Gujarat                4kw rooftop setup for home Photovoltaic Solar...
Karnataka              language barrier no requirements 1kw rooftop ...
Madhya Pradesh         Business PartnerDisqualified Mailed questionna...
Maharashtra            Rupdaypur, panskura(r.s) Purba Medinipur 150kw...

我想从此数据框中删除标点符号和停用词。我已经完成了以下代码。但它不起作用 -

import nltk
nltk.download('stopwords')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
import re

def message_cleaning(message):
    Test_punc_removed = [char for char in message if char not in string.punctuation]
    Test_punc_removed_join = ''.join(Test_punc_removed)
    Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
    return Test_punc_removed_join_clean

df['text'] = df['text'].apply(message_cleaning)

AttributeError: 'set' object has no attribute 'words'

【问题讨论】：

标签： python-3.x pandas scikit-learn nltk

【解决方案1】：

问题：我认为您的 stopwords 存在名称冲突。您的笔记本中可能有一行您分配的位置：

stopwords = stopwords.words("english")

这可以解释这个问题，因为调用 stopwords 会变得模棱两可：你指的是变量而不是包。

解决方案：让事情明确：

首先分配一个引用停用词的变量（这比每次都调用它要快）

from nltk.corpus import stopwords
english_stop_words = set(stopwords.words("english"))

在你的函数中使用它：

Test_punc_removed_join_clean = [
    word for word in Test_punc_removed_join.split() 
    if word.lower() not in english_stop_words
]

【讨论】：