【发布时间】:2019-05-18 13:04:38
【问题描述】:
我有一个包含 100 万条记录的数据集,如下所示
样本 DF1:-
articles_urlToImage feed_status status keyword
hhtps://rqqkf.com untagged tag the apple,a mobile phone
hhtps://hqkf.com tagged ingore blackberry, the a phone
hhtps://hqkf.com untagged tag amazon, an shopping site
现在我想删除停用词和一些自定义停用词,如下所示
自定义停用词 = ['phone','site'](我有大约 35 个自定义停用词)
预期输出
articles_urlToImage feed_status status keyword
hhtps://rqqkf.com untagged tag apple,mobile
hhtps://hqkf.com tagged ingore blackberry
hhtps://hqkf.com untagged tag amazon,shopping
我尝试删除停用词,但出现以下错误
代码
import nltk
import string
from nltk.corpus import stopwords
stop = stopwords.words('english')
df1['keyword'] = df1['keyword'].apply(lambda x: [item for item in x if item not in stop])
错误
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __getattr__(self, name)
3612 if name in self._info_axis:
3613 return self[name]
-> 3614 return object.__getattribute__(self, name)
3615
3616 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'split'
【问题讨论】:
-
收到此错误
LookupError: ********************************************************************** Resource stopwords not found. Please use the NLTK Downloader to obtain the resource: >>> import nltk >>> nltk.download('stopwords') Searched in: - '/root/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - '/usr/nltk_data' - '/usr/lib/nltk_data' ********************************************************************** -
Google 搜索只需点击一下,不是吗?