NLTK WordNet Lemmatizer：它不应该对单词的所有变形进行词形还原吗？答案

【问题标题】：NLTK WordNet Lemmatizer: Shouldn't it lemmatize all inflections of a word?NLTK WordNet Lemmatizer：它不应该对单词的所有变形进行词形还原吗？
【发布时间】：2019-08-28 17:56:06
【问题描述】：

我将 NLTK WordNet Lemmatizer 用于词性标记项目，首先将训练语料库中的每个单词修改为其词干（就地修改），然后仅在新语料库上进行训练。但是，我发现 lemmatizer 没有像我预期的那样运行。

例如，单词loves 被词形还原为love，这是正确的，但词loving 即使在词形还原之后仍然是loving。这里loving 就像句子“我很喜欢它”一样。

love不是变形词loving的词干吗？类似地，许多其他“ing”形式在词形还原后保持不变。这是正确的行为吗？

还有哪些其他准确的词形还原器？（不需要在 NLTK 中）在决定词干时，是否有形态分析器或词形还原器也考虑到词的词性标签？例如，如果killing 用作动词，单词killing 应该有kill 作为词干，但如果它用作名词，它应该有killing 作为词干（如the killing was done by xyz） .

【问题讨论】：

标签： python nlp nltk

【解决方案1】：

WordNet lemmatizer 确实考虑了 POS 标签，但它不会神奇地确定它：

>>> nltk.stem.WordNetLemmatizer().lemmatize('loving')
'loving'
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving', 'v')
u'love'

如果没有 POS 标签，它会假定您输入的所有内容都是名词。所以这里它认为你正在传递名词“爱”（如“甜蜜的爱”）。

【讨论】：

感谢您的回答！你能告诉我，它需要的所有标签是什么？ n-名词,v=动词 ...?
@AbhishekBhatia 您可以使用WordNetCorpusReader.ADJ/ADJ_SAT/ADV/NOUN/VERB（分别具有值“a”、“s”、“r”、“n”、“v”）。

【解决方案2】：

解决此问题的最佳方法是实际查看 Wordnet。看看这里：Loving in wordnet。如您所见，Wordnet 中实际上存在一个形容词“爱”。事实上，甚至还有副词“lovelyly”：lovingly in Wordnet。因为 wordnet 实际上并不知道您真正想要什么词性，所以它默认为名词（Wordnet 中的“n”）。如果您使用 Penn Treebank 标签集，这里有一些将 Penn 转换为 WN 标签的方便函数：

from nltk.corpus import wordnet as wn

def is_noun(tag):
    return tag in ['NN', 'NNS', 'NNP', 'NNPS']


def is_verb(tag):
    return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']


def is_adverb(tag):
    return tag in ['RB', 'RBR', 'RBS']


def is_adjective(tag):
    return tag in ['JJ', 'JJR', 'JJS']


def penn_to_wn(tag):
    if is_adjective(tag):
        return wn.ADJ
    elif is_noun(tag):
        return wn.NOUN
    elif is_adverb(tag):
        return wn.ADV
    elif is_verb(tag):
        return wn.VERB
    return None

希望这会有所帮助。

【讨论】：

wnpos = lambda e: ('a' if e[0].lower() == 'j' else e[0].lower()) if e[0].lower()在 ['n', 'r', 'v'] 否则 'x'
1 行比 28 行好一点 ;)
但是，它应该是wnpos = lambda e: ('a' if e[0].lower() == 'j' else e[0].lower()) if e[0].lower() in ['n', 'r', 'v'] else 'n'，因为该函数的默认值是NOUN，而不是'x'或None。

【解决方案3】：

比枚举更清晰有效：

from nltk.corpus import wordnet

def get_wordnet_pos(self, treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

def penn_to_wn(tag):
    return get_wordnet_pos(tag)

【讨论】：

【解决方案4】：

作为上述@Fred Foo 已接受答案的扩展；

from nltk import WordNetLemmatizer, pos_tag, word_tokenize

lem = WordNetLemmatizer()
word = input("Enter word:\t")

# Get the single character pos constant from pos_tag like this:
pos_label = (pos_tag(word_tokenize(word))[0][1][0]).lower()

# pos_refs = {'n': ['NN', 'NNS', 'NNP', 'NNPS'],
#            'v': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'],
#            'r': ['RB', 'RBR', 'RBS'],
#            'a': ['JJ', 'JJR', 'JJS']}

if pos_label == 'j': pos_label = 'a'    # 'j' <--> 'a' reassignment

if pos_label in ['r']:  # For adverbs it's a bit different
    print(wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name())
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
    print(lem.lemmatize(word, pos=pos_label))
else:   # For nouns and everything else as it is the default kwarg
    print(lem.lemmatize(word))

【讨论】：