【问题标题】:How to remove punctuation and irrelevant words with stopwords (Text Mining)如何使用停用词删除标点符号和不相关的词(文本挖掘)
【发布时间】:2020-08-13 17:15:52
【问题描述】:

我正在使用的库是:

      import pandas as pd
      import string
      from nltk.corpus import stopwords
      import nltk

我有以下数据框:

     df = pd.DataFrame({'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells 
                                  with clearly defined nuclei).',
                                 'The Golgi apparatus is responsible for transporting, modifying, and 
                                  packaging proteins',
                                 'Non-foliated metamorphic rocks do not have a platy or sheet-like 
                                  structure.',
                                 'The process of metamorphism does not melt the rocks.'], 
                                 'Class': ['biology', 'biology', 'geography', 'geography']})

     print(df)

                              Send                           Class
         Golgi body, membrane-bound organelle of eukary...  biology
         The Golgi apparatus is responsible for transpo...  biology
         Non-foliated metamorphic rocks do not have a p...  geography
         The process of metamorphism does not melt the ...  geography

我想生成一个函数来清理“发送”列中的数据。我想:

  1. 删除分数;
  2. 删除停用词'stopwords';
  3. 返回一个新的数据框,其“发送”列包含“干净的单词”。

尝试开发以下功能:

      def Text_Process(mess): 
           nopunc = [char for char in mess if char not in string.punctuation]
           nopunc = ''.join(nopunc)  
           return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

但是,返回的感觉并不是我想要的。当我跑步时:

        Text_Process(df['Send'])

输出是:

       ['Golgi', 'body,', 'membrane-bound', 'organelle', 'eukaryotic', 'cells', '(cells', 'clearly',
        'defined', 'nuclei).The', 'Golgi', 'apparatus', 'responsible',  'transporting,', 
        'modifying,', 'packaging', 'proteinsNon-foliated', 'metamorphic', 'rocks',
        'platy', 'sheet-like', 'structure.The', 'process', 'metamorphism',
        'melt', 'rocks.']

我希望输出是带有修改后的“发送”列的数据框:

       df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells 
                                   clearly defined nuclei',
                                  'Golgi apparatus responsible transporting modifying                                     
                                   packaging proteins',
                                 'Non foliated metamorphic rocks platy sheet like 
                                  structure',
                                 'process metamorphism mel rocks'], 
                                 'Class': ['biology', 'biology', 'geography', 'geography']})

我希望输出是“发送”列干净的数据框(没有分数和不相关的单词)。

谢谢。

【问题讨论】:

    标签: python text nltk stop-words mining


    【解决方案1】:

    这是一个清理列的脚本。请注意,您可能希望在停用词集中添加更多字词以满足您的要求。

    import pandas as pd
    import string
    import re
    from nltk.corpus import stopwords
    
    df = pd.DataFrame(
        {'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells with clearly defined nuclei).',
                  'The Golgi apparatus is responsible for transporting, modifying, and packaging proteins',
                  'Non-foliated metamorphic rocks do not have a platy or sheet-like structure.',
                  'The process of metamorphism does not melt the rocks.'],
         'Class': ['biology', 'biology', 'geography', 'geography']})
    
    table = str.maketrans('', '', string.punctuation)
    
    def text_process(mess):
        words = re.split(r'\W+', mess)
        nopunc = [w.translate(table) for w in words]
        nostop =  ' '.join([word for word in nopunc if word.lower() not in stopwords.words('english')])
        return nostop
    
    df['Send'] = df.apply(lambda row: text_process(row.Send), axis=1)
    
    print(df)
    

    输出:

                                                                                     Send      Class
    0  Golgi body membrane bound organelle eukaryotic cells cells clearly defined nuclei     biology
    1               Golgi apparatus responsible transporting modifying packaging proteins    biology
    2                          Non foliated metamorphic rocks platy sheet like structure   geography
    3                                                    process metamorphism melt rocks   geography
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-09-21
      • 2014-09-26
      • 2015-01-10
      • 2017-06-01
      • 2020-10-04
      • 1970-01-01
      • 2020-11-06
      • 2019-09-16
      相关资源
      最近更新 更多