【问题标题】：Porter and Lancaster stemming clarification波特和兰开斯特词干澄清
【发布时间】：2020-06-08 18:06:17
【问题描述】：

我正在使用Porter 和Lancaster 做stemming，我发现这些观察结果：

Input: replied
Porter: repli
Lancaster: reply


Input:  twice
porter:  twice
lancaster:  twic

Input:  came
porter:  came
lancaster:  cam

Input:  In
porter:  In
lancaster:  in

我的问题是：

Lancaster 应该是“激进的”stemmer，但它与 replied 一起工作正常。为什么？
Porter 中的单词 In 与大写 In 保持一致，为什么？
注意到Lancaster 正在删除以e 结尾的单词，为什么？

我无法理解这些概念。你能帮忙吗？

【问题讨论】：

其实问得好！！！
很好地抓住了词干分析器的古怪行为，在大写输出 github.com/nltk/nltk/issues/2507 上提出了问题。再次感谢您帮助解决这个问题！

标签： nlp nltk stemming porter-stemmer nltk-book

【解决方案1】：

问：Lancaster 应该是“激进的”词干提取器，但它与 `replied` 正常工作。为什么？

这是因为 https://github.com/nltk/nltk/pull/1654 中改进了 Lancaster 词干分析器的实现

如果我们看一下https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L62，有一个后缀规则，改变-ied > -y

default_rule_tuple = (
    "ai*2.",   # -ia > -   if intact
    "a*1.",    # -a > -    if intact
    "bb1.",    # -bb > -b
    "city3s.", # -ytic > -ys
    "ci2>",    # -ic > -
    "cn1t>",   # -nc > -nt
    "dd1.",    # -dd > -d
    "dei3y>",  # -ied > -y
    ...)

该功能允许用户输入新规则，如果没有添加其他规则，则它将使用parseRules 中的self.default_rule_tuple，其中将应用rule_tuple https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L196

def parseRules(self, rule_tuple=None):
    """Validate the set of rules used in this stemmer.
    If this function is called as an individual method, without using stem
    method, rule_tuple argument will be compiled into self.rule_dictionary.
    If this function is called within stem, self._rule_tuple will be used.
    """
    # If there is no argument for the function, use class' own rule tuple.
    rule_tuple = rule_tuple if rule_tuple else self._rule_tuple
    valid_rule = re.compile("^[a-z]+\*?\d[a-z]*[>\.]?$")
    # Empty any old rules from the rule set before adding new ones
    self.rule_dictionary = {}

    for rule in rule_tuple:
        if not valid_rule.match(rule):
            raise ValueError("The rule {0} is invalid".format(rule))
        first_letter = rule[0:1]
        if first_letter in self.rule_dictionary:
            self.rule_dictionary[first_letter].append(rule)
        else:
            self.rule_dictionary[first_letter] = [rule]

default_rule_tuple 实际上来自 paice-husk stemmer 的 whoosh 实现，也就是兰开斯特词干分析器 https://github.com/nltk/nltk/pull/1661 =)

问：在 Porter 中，In 还是大写的 In，为什么？

这太有趣了！而且很可能是一个错误。

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('In')
'In'

如果我们查看代码，PorterStemmer.stem() 的第一件事就是将其变为小写，https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L651

def stem(self, word):
    stem = word.lower()

    if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
        return self.pool[word]

    if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
        # With this line, strings of length 1 or 2 don't go through
        # the stemming process, although no mention is made of this
        # in the published algorithm.
        return word

    stem = self._step1a(stem)
    stem = self._step1b(stem)
    stem = self._step1c(stem)
    stem = self._step2(stem)
    stem = self._step3(stem)
    stem = self._step4(stem)
    stem = self._step5a(stem)
    stem = self._step5b(stem)

    return stem

但是如果我们查看代码，其他所有内容都返回stem，它是小写的，但是有两个 if 子句返回某种形式的原始word 没有被小写！！！

if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
    return self.pool[word]

if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
    # With this line, strings of length 1 or 2 don't go through
    # the stemming process, although no mention is made of this
    # in the published algorithm.
    return word

第一个 if 子句检查单词是否在包含不规则单词及其词干的 self.pool 内。

第二个检查len(word)

问：请注意，Lancaster 正在删除“来”中以 `e` 结尾的单词，为什么？

毫不奇怪，也来自default_rule_tuple https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L67，有一条规则会改变-e > - =)

问：如何禁用`default_rule_tuple` 中的`-e > -` 规则？

(Un-)幸运的是，LancasterStemmer._rule_tuple 对象是一个不可变的元组，所以我们不能简单地从中删除一项，但我们可以覆盖它 =)

>>> from nltk.stem import LancasterStemmer
>>> lancaster = LancasterStemmer()
>>> lancaster.stem('came')
'cam'

# Create a new stemmer object to refresh the cache.
>>> lancaster = LancasterStemmer()
>>> temp_rule_list = list(lancaster._rule_tuple)
# Find the 'e1>' rule.
>>> lancaster._rule_tuple.index('e1>') 
12

# Create a temporary rule list from the tuple.
>>> temp_rule_list = list(lancaster._rule_tuple)
# Remove the rule.
>>> temp_rule_list.pop(12)
'e1>'
# Override the `._rule_tuple` variable.
>>> lancaster._rule_tuple = tuple(temp_rule_list)

# Et voila!
>>> lancaster.stem('came')
'came'

【讨论】：

问：Lancaster 应该是“激进的”词干提取器，但它与 replied 正常工作。为什么？

问：在 Porter 中，In 还是大写的 In，为什么？

问：请注意，Lancaster 正在删除“来”中以 e 结尾的单词，为什么？

问：如何禁用default_rule_tuple 中的-e &gt; - 规则？

问：Lancaster 应该是“激进的”词干提取器，但它与 `replied` 正常工作。为什么？

问：请注意，Lancaster 正在删除“来”中以 `e` 结尾的单词，为什么？

问：如何禁用`default_rule_tuple` 中的`-e > -` 规则？