【发布时间】:2020-08-13 17:15:52
【问题描述】:
我正在使用的库是:
import pandas as pd
import string
from nltk.corpus import stopwords
import nltk
我有以下数据框:
df = pd.DataFrame({'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells
with clearly defined nuclei).',
'The Golgi apparatus is responsible for transporting, modifying, and
packaging proteins',
'Non-foliated metamorphic rocks do not have a platy or sheet-like
structure.',
'The process of metamorphism does not melt the rocks.'],
'Class': ['biology', 'biology', 'geography', 'geography']})
print(df)
Send Class
Golgi body, membrane-bound organelle of eukary... biology
The Golgi apparatus is responsible for transpo... biology
Non-foliated metamorphic rocks do not have a p... geography
The process of metamorphism does not melt the ... geography
我想生成一个函数来清理“发送”列中的数据。我想:
- 删除分数;
- 删除停用词'stopwords';
- 返回一个新的数据框,其“发送”列包含“干净的单词”。
尝试开发以下功能:
def Text_Process(mess):
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
但是,返回的感觉并不是我想要的。当我跑步时:
Text_Process(df['Send'])
输出是:
['Golgi', 'body,', 'membrane-bound', 'organelle', 'eukaryotic', 'cells', '(cells', 'clearly',
'defined', 'nuclei).The', 'Golgi', 'apparatus', 'responsible', 'transporting,',
'modifying,', 'packaging', 'proteinsNon-foliated', 'metamorphic', 'rocks',
'platy', 'sheet-like', 'structure.The', 'process', 'metamorphism',
'melt', 'rocks.']
我希望输出是带有修改后的“发送”列的数据框:
df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells
clearly defined nuclei',
'Golgi apparatus responsible transporting modifying
packaging proteins',
'Non foliated metamorphic rocks platy sheet like
structure',
'process metamorphism mel rocks'],
'Class': ['biology', 'biology', 'geography', 'geography']})
我希望输出是“发送”列干净的数据框(没有分数和不相关的单词)。
谢谢。
【问题讨论】:
标签: python text nltk stop-words mining