【问题标题】:How to extend the stopword list from NLTK and remove stop words with the extended list?如何从 NLTK 扩展停用词列表并使用扩展列表删除停用词?
【发布时间】:2015-05-30 06:27:09
【问题描述】:

我尝试了两种删除停用词的方法,但都遇到了问题:

方法一:

cachedStopWords = stopwords.words("english")
words_to_remove = """with some your just have from it's /via & that they your there this into providing would can't"""
remove = tu.removal_set(words_to_remove, query)
remove2 = tu.removal_set(cachedStopWords, query)

在这种情况下,只有第一个删除功能有效。 remove2 不起作用。

方法二:

lines = tu.lines_cleanup([sentence for sentence in sentence_list], remove=remove)
words = '\n'.join(lines).split()
print words # list of words

输出看起来像这样["Hello", "Good", "day"]

我尝试从单词中删除停用词。这是我的代码:

for word in words:
    if word in cachedStopwords:
        continue
    else:
        new_words='\n'.join(word)

print new_words

输出如下所示:

H
e
l
l
o

无法弄清楚上述两种方法有什么问题。请指教。

【问题讨论】:

    标签: python nlp nltk stop-words


    【解决方案1】:

    使用它来增加停用词列表:

    from nltk.corpus import stopwords
    stop_words = stopwords.words('english')
    print(len(stop_words))
    stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
    print(len(stop_words))
    

    输出:

    179

    184

    【讨论】:

      【解决方案2】:

      我认为您想要实现的是扩展 NLTK 的停用词列表。由于 NLTK 中的停用词保存在一个列表中,您可以简单地这样做:

      >>> from nltk.corpus import stopwords
      >>> stoplist = stopwords.words('english')
      >>> stoplist
      [u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']
      >>> more_stopwords = """with some your just have from it's /via & that they your there this into providing would can't"""
      >>> stoplist += more_stopwords.split()
      >>> sent = "With some of hacks to your line of code , we can simply extract the data you need ."
      >>> sent_with_no_stopwords = [word for word in sent.split() if word not in stoplist]
      >>> sent_with_no_stopwords
      ['With', 'hacks', 'line', 'code', ',', 'simply', 'extract', 'data', 'need', '.']
      # Note that the "With" is different from "with".
      # So let's try this:
      >>> sent_with_no_stopwords = [word for word in sent.lower().split() if word not in stoplist]
      >>> sent_with_no_stopwords
      ['hacks', 'line', 'code', ',', 'simply', 'extract', 'data', 'need', '.']
      # To get it back into a string:
      >>> new_sent = " ".join(sent_with_no_stopwords)
      >>> new_sent
      'hacks line code , simply extract data need .'
      

      【讨论】:

        【解决方案3】:

        你可以改变方法2:

        for word in words:
            if word in cachedStopwords:
                continue
            else:
                new_words='\n'.join(word)
        
        print new_words
        

        到:

        new_words = []
        for word in words:
            if word in stop_words:
                continue
            else:
                new_words.append(word)
        
        print new_words
        

        【讨论】:

          【解决方案4】:

          您需要标记您的字符串:

          words = string.split()
          

          这是一种简单的方法,尽管 nltk 有其他标记器。

          然后可能是列表理解:

          words = [w for w in words if w not in cachedstopwords]
          

          这个:

          from nltk.corpus import stopwords
          
          stop_words = stopwords.words("english")
          sentence = "You'll want to tokenise your string"
          
          words = sentence.split()
          print words
          words = [w for w in words if w not in stop_words]
          print words
          

          打印:

          ["You'll", 'want', 'to', 'tokenise', 'your', 'string']
          ["You'll", 'want', 'tokenise', 'string']
          

          【讨论】:

            猜你喜欢
            • 2021-02-02
            • 2021-05-08
            • 2014-05-10
            • 2018-04-05
            • 1970-01-01
            • 2020-07-08
            • 2019-01-03
            • 1970-01-01
            • 2018-09-28
            相关资源
            最近更新 更多