如何使用停用词删除标点符号和不相关的词（文本挖掘）答案

【问题标题】：How to remove punctuation and irrelevant words with stopwords (Text Mining)如何使用停用词删除标点符号和不相关的词（文本挖掘）
【发布时间】：2020-08-13 17:15:52
【问题描述】：

我正在使用的库是：

      import pandas as pd
      import string
      from nltk.corpus import stopwords
      import nltk

我有以下数据框：

     df = pd.DataFrame({'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells 
                                  with clearly defined nuclei).',
                                 'The Golgi apparatus is responsible for transporting, modifying, and 
                                  packaging proteins',
                                 'Non-foliated metamorphic rocks do not have a platy or sheet-like 
                                  structure.',
                                 'The process of metamorphism does not melt the rocks.'], 
                                 'Class': ['biology', 'biology', 'geography', 'geography']})

     print(df)

                              Send                           Class
         Golgi body, membrane-bound organelle of eukary...  biology
         The Golgi apparatus is responsible for transpo...  biology
         Non-foliated metamorphic rocks do not have a p...  geography
         The process of metamorphism does not melt the ...  geography

我想生成一个函数来清理“发送”列中的数据。我想：

删除分数；
删除停用词'stopwords'；
返回一个新的数据框，其“发送”列包含“干净的单词”。

尝试开发以下功能：

      def Text_Process(mess): 
           nopunc = [char for char in mess if char not in string.punctuation]
           nopunc = ''.join(nopunc)  
           return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

但是，返回的感觉并不是我想要的。当我跑步时：

        Text_Process(df['Send'])

输出是：

       ['Golgi', 'body,', 'membrane-bound', 'organelle', 'eukaryotic', 'cells', '(cells', 'clearly',
        'defined', 'nuclei).The', 'Golgi', 'apparatus', 'responsible',  'transporting,', 
        'modifying,', 'packaging', 'proteinsNon-foliated', 'metamorphic', 'rocks',
        'platy', 'sheet-like', 'structure.The', 'process', 'metamorphism',
        'melt', 'rocks.']

我希望输出是带有修改后的“发送”列的数据框：

       df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells 
                                   clearly defined nuclei',
                                  'Golgi apparatus responsible transporting modifying                                     
                                   packaging proteins',
                                 'Non foliated metamorphic rocks platy sheet like 
                                  structure',
                                 'process metamorphism mel rocks'], 
                                 'Class': ['biology', 'biology', 'geography', 'geography']})

我希望输出是“发送”列干净的数据框（没有分数和不相关的单词）。

谢谢。

【问题讨论】：

标签： python text nltk stop-words mining

【解决方案1】：

这是一个清理列的脚本。请注意，您可能希望在停用词集中添加更多字词以满足您的要求。

import pandas as pd
import string
import re
from nltk.corpus import stopwords

df = pd.DataFrame(
    {'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells with clearly defined nuclei).',
              'The Golgi apparatus is responsible for transporting, modifying, and packaging proteins',
              'Non-foliated metamorphic rocks do not have a platy or sheet-like structure.',
              'The process of metamorphism does not melt the rocks.'],
     'Class': ['biology', 'biology', 'geography', 'geography']})

table = str.maketrans('', '', string.punctuation)

def text_process(mess):
    words = re.split(r'\W+', mess)
    nopunc = [w.translate(table) for w in words]
    nostop =  ' '.join([word for word in nopunc if word.lower() not in stopwords.words('english')])
    return nostop

df['Send'] = df.apply(lambda row: text_process(row.Send), axis=1)

print(df)

输出：

                                                                                 Send      Class
0  Golgi body membrane bound organelle eukaryotic cells cells clearly defined nuclei     biology
1               Golgi apparatus responsible transporting modifying packaging proteins    biology
2                          Non foliated metamorphic rocks platy sheet like structure   geography
3                                                    process metamorphism melt rocks   geography

【讨论】：