【问题标题】:Why is the number of stem from NLTK Stemmer outputs different from expected output?为什么来自 NLTK Stemmer 输出的词干数量与预期输出不同?
【发布时间】:2020-06-28 18:51:51
【问题描述】:

我必须对文本执行词干提取。问题如下:

  1. 标记tc 中给出的所有单词。该单词应包含字母或数字或下划线。将标记化的单词列表存储在tw
  2. 将所有单词转换为小写。将结果存入变量tw
  3. 从唯一的tw 集中删除所有停用词。将结果存入变量fw
  4. 使用 PorterStemmer 提取 fw 中的每个单词,并将结果存储在列表 psw

下面是我的代码:

import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem  import PorterStemmer,LancasterStemmer

pattern = r'\w+';
tw= nltk.regexp_tokenize(tc,pattern);
tw= [word.lower() for word in tw];
stop_word = set(stopwords.words('english'));
fw= [w for w in tw if not w in stop_word];
#print(sorted(filteredwords));
porter = PorterStemmer();
psw = [porter.stem(word) for word in fw];
print(sorted(psw));

我的代码与所有提供的测试用例完美配合,但仅在以下测试用例中失败

tc = “我上周无意中去了喜诗糖果(我在商场里找电话维修),结果发现喜诗糖果现在收费一美元——一整美元——即使是最简单的他们的小甜点。我买了两个巧克力棒棒糖和两个巧克力焦糖杏仁东西。总成本是四左右。我的意思是,糖果很好吃,但让我们成为现实:一个士力架是 50 美分。这个每糖果一美元的启示,我可能不会很快发现自己梦幻般地回到See's Candy。”

我的输出是:

['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol' , 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', '棒棒糖','look','mall','may','mean','offer','per','phone','real','repair','revel','see','simplest' , 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', '去了']

预期输出是:

['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'candi', 'caramel', 'cent', 'charg' , 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', ' let'、'lollipop'、'look'、'mall'、'may'、'mean'、'offer'、'per'、'phone'、'real'、'repair'、'revel'、'see' , 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', '周','去']

区别在于'Candi'的出现

寻求帮助以解决问题。

【问题讨论】:

  • 什么代码产生了“预期的输出”?它是否使用了与您正在使用的所有内容相同的版本?
  • 执行代码时反映的预期输出。我看不到得到预期输出的代码。是的,我相信版本是一样的。因为动手评估是在 Web IDE 上进行的。

标签: python list nlp nltk stemming


【解决方案1】:

尝试使用:

import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem  import PorterStemmer,LancasterStemmer

pattern = r'\w+';
tw= nltk.regexp_tokenize(tc,pattern);
tw= [word.lower() for word in tw];
unique_tw = set(tw); #Unique Set of Tokenized words(See Your Step3)
stop_word = set(stopwords.words('english'));
fw= [w for w in unique_tw if not w in stop_word];# Remove stopwords from unique_tw
porter = PorterStemmer();
psw = [porter.stem(word) for word in fw];
print(sorted(psw));

因为第 3 步是:从唯一的 tw 集中删除所有停用词。

【讨论】:

    【解决方案2】:

    这是因为“Candy”这个词的Title大小写和小写

    上周我无意中去了喜诗糖果(我在商场里找电话维修),结果发现喜诗糖果现在收费一美元——一整美元——即使是他们最简单的小甜点供品。我买了两个巧克力棒棒糖和两个巧克力焦糖杏仁东西。总成本是四左右。我的意思是,糖果很好吃,但让我们变得真实:一块士力架是 50 美分。在这个 糖果 美元的启示之后,我可能不会很快发现自己梦幻般地徘徊在 See's Candy 中。

    【讨论】:

      【解决方案3】:
         
          stopword = set(nltk.corpus.stopwords.words('english'))
          pattern = "\w+"
      
          tokenizedwords = nltk.regexp_tokenize(textcontent,pattern)
      
          filteredwords = [word for word in tokenizedwords if word.lower() not in stopword]
      
          porter = nltk.PorterStemmer()
          porterstemmedwords = [porter.stem(word.lower()) for word in set(filteredwords)]
      
          lancaster = nltk.LancasterStemmer()
          lancasterstemmedwords = [lancaster.stem(word.lower()) for word in set(filteredwords)]
      
          net_lemmatizer = nltk.WordNetLemmatizer()
          lemmatizedwords = [net_lemmatizer.lemmatize(word.lower()) for word in set(filteredwords)]
      
          return porterstemmedwords,lancasterstemmedwords,lemmatizedwords
      

      请试试上面的那个。不要将单词开头转换为小写字母。

      【讨论】:

        【解决方案4】:

        首先,不要多次遍历文本,见Why is my NLTK function slow when processing the DataFrame?

        改为这样做,您只需遍历数据/文本一次:

        import re
        
        from nltk import word_tokenize, regexp_tokenize
        from nltk.corpus import stopwords
        from nltk.stem  import PorterStemmer
        
        stop_word = set(stopwords.words('english'))
        porter = PorterStemmer()
        
        text = "I inadvertently went to See's Candy last week (I was in the mall looking for phone repair), and as it turns out, See's Candy now charges a dollar -- a full dollar -- for even the simplest of their wee confection offerings. I bought two chocolate lollipops and two chocolate-caramel-almond things. The total cost was four-something. I mean, the candies were tasty and all, but let's be real: A Snickers bar is fifty cents. After this dollar-per-candy revelation, I may not find myself wandering dreamily back into a See's Candy any time soon."
        
        signature = [porter.stem(word.lower()) 
                     for word in regexp_tokenize(text,r'\w+') 
                     if word.lower() not in stop_word]
        

        接下来,让我们检查一下预期的输出:

        signature = [(word, porter.stem(word.lower())) for word in regexp_tokenize(text,r'\w+')]
        
        expected = ['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']
        
        sorted(signature) == expected  # -> False
        

        [出]:

        False
        

        这不是一个好兆头,让我们找出缺少哪些术语:

        # If item in signature but not in expected.
        len(set(signature).difference(expected)) == 0  # -> True
        # If item in expected but not in signature. 
        len(set(expected).difference(signature)) == 0  # -> True
        

        在这种情况下,让我们检查计数:

        print(len(signature), len(expected))
        

        [出]:

        57 49
        

        您的预期输出似乎缺少很多项目。检查通过:

        from collections import Counter
        counter_signature = Counter(signature)
        counter_expected = Counter(expected)
        
        
        for word, count in counter_signature.items():
            # If the count in expected is different.
            expected_count = counter_expected[word]
            if count != expected_count: 
                print(word, count, expected_count)
        

        似乎不仅candi有不同的计数!

        [出]:

        see 3 1
        candi 5 3
        dollar 3 1
        two 2 1
        chocol 2 1
        

        看起来签名(即处理的文本)包含的计数比问题中预期输出的预期多得多。所以很可能你的测试没有计算正确=)

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2019-12-04
          • 1970-01-01
          • 1970-01-01
          • 2019-10-22
          • 2014-02-08
          • 2022-07-06
          相关资源
          最近更新 更多