如何生成没有停用词的二元组答案

【问题标题】：how to produce bigrams without stop words如何生成没有停用词的二元组
【发布时间】：2017-02-13 15:30:50
【问题描述】：

我编写了这个函数，用于使用 nltk.bigrams 从字符串生成二元组并忽略停用词和字母，但停用词和字母仍然出现在输出中。请帮我纠正这个功能。

       def bigramReturner (tweetString, stopWords):
           bigramFeatureVector = []
           tweetStringG = tweetString.lower()
           tweetStringG = tweetString.split()
           for i in tweetStringG:
               i =replaceTwoOrMore(i)
               i =i.strip('\'"?,.')
               val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*[a-zA-Z]+[a-zA-Z0-9]*$", i)
               if(i in stopWords  is None):
                   continue
               else:
                  for i in nltk.bigrams(tweetStringG):
                        bigramFeatureVector.append(' '.join(i))
           return bigramFeatureVector

【问题讨论】：

if(i in stopWords is None) --> 通常使用 if i not in stopWords 。另外，对我来说，您似乎在明确匹配停用词，而不是排除它们？（如果不在停用词中）：Y，（否则：）附加到 bigramFeatureVector。还是我看错了？
我不认为 (i in stopWords is None) 是一个合理的说法。错误！= 无。你也检查了错误的情况，即使这不是真的。如果它在 stopWords 中，您想继续，如果不在，则不要。
我有类似的功能，可以将推文转换为一组令牌并且它正在工作

标签： python nltk sentiment-analysis

【解决方案1】：

尝试删除 is None 检查，因为当前您正在将 True 或 False 与 None 进行比较

   def bigramReturner (tweetString, stopWords):
       bigramFeatureVector = []
       tweetStringG = tweetString.lower()
       tweetStringG = tweetString.split()
       for i in tweetStringG:
           i =replaceTwoOrMore(i)
           i =i.strip('\'"?,.')
           val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*[a-zA-Z]+[a-zA-Z0-9]*$", i)
           if(i in stopWords):
               continue
           else:
              for i in nltk.bigrams(tweetStringG):
                    bigramFeatureVector.append(' '.join(i))
       return bigramFeatureVector

【讨论】：

还是不行，我认为是 nltk.bigrams 中的问题