词形还原熊猫（Python）答案

【问题标题】：Lemmatization Pandas (Python)词形还原熊猫（Python）
【发布时间】：2018-07-10 13:56:00
【问题描述】：

我是 Pandas 的初学者，我正在尝试弄清楚如何对我的数据框的单列进行词形还原。以下面的例子为例（这是（不）常用词删除后的一些文本，我想对其进行词形还原）：

0 好需要改变 virgils 天然微酿...

1 个新的最爱，令人愉快的惊喜发现 fl...

2红酱最爱享受浓郁的单宁ok拉...

3 质量很棒的 1800 年代 21 世纪尝试饮料......

4红色第一次尝试爱100出色的混合...

这是我用来进行词形还原的代码（取自here）：

df['words'] = df['words'].apply(lambda x: "".join([Word(word).lemmatize() for word in x]))
df['words'].head()

但是一旦运行这段代码，输出就不会改变：

0好需要换维吉尔天然微酿r...

1 个新的最爱，令人愉快的惊喜发现 fl...

2红酱最爱享受浓郁的单宁ok拉...

3 质量很棒的 1800 年代 21 世纪尝试饮料......

4红色第一次尝试爱100出色的混合...

任何帮助将不胜感激:)

P.S：words 是一个标记词列表

【问题讨论】：

它看起来像 needs => need、changes => change 和 virgils => virgil 所以输出确实改变了。
@Scratch'N'Purr 哦，是的...我更关注trying 不应该变成try 吗？或者brewed 应该变成brew
公平点。在这种情况下，您的问题可能比实际情况更复杂，因为您必须为刚才提到的动词指定词性 (POS)。如果您在不指定 POS 的情况下运行 lemmatize 方法，它将失败。因此，对于trying 和brewed，代码必须是Word('trying').lemmatize('v') 和Word('brewed').lemmatize('v')。 Source

标签： python pandas lemmatization

【解决方案1】：

您可能不再需要解决方案，但如果您想在许多 pos 上进行词形还原，您可以使用：

如果你想要更多，你可以试试下面的代码：

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
stop = stopwords.words('english')


def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)



# Lemmatizing
df['Lemmatize'] = df['word'].apply(lambda x: lemmatize_sentence(x))
print(df.head())

df 结果：

         word                       |        Lemmatize

0  Best scores, good cats, it rocks | Best score , good cat , it rock

1          You received best scores |          You receive best score

2                         Good news |                       Good news

3                          Bad news |                        Bad news

4                    I am loving it |                    I be love it

5                    it rocks a lot |                   it rock a lot

6     it is still good to do better |     it be still good to do good

【讨论】：