词干化词答案

【问题标题】：Stemming on tokenized words词干化词
【发布时间】：2021-01-06 17:30:36
【问题描述】：

拥有这个数据集：

>cleaned['text']
0         [we, have, a, month, open, #postdoc, position,...
1         [the, hardworking, biofuel, producers, in, iow...
2         [the, hardworking, biofuel, producers, in, iow...
3         [in, today, s, time, it, is, imperative, to, r...
4         [special, thanks, to, gaetanos, beach, club, o...
                                ...                        
130736    [demand, gw, sources, fossil, fuels, renewable...
130737         [there, s, just, not, enough, to, go, round]
130738    [the, answer, to, deforestation, lies, in, space]
130739    [d, filament, from, plastic, waste, regrind, o...
130740          [gb, grid, is, generating, gw, out, of, gw]
Name: text, Length: 130741, dtype: object

有没有一种简单的方法来阻止所有单词？

【问题讨论】：

您的意思是取demand 并推断出英语匹配项，如demanding、demanded、demands 等？
例如，数据集包含单词“car”和“cars”。我希望它们都一样。
你要求的很困难。英语没有容易允许的常规形式。查看该问题的一些相关链接以获得几个近似值。
实际上，在 R 中这很容易做到，所以我想有一种方法可以做到这一点。说白了，我猜这样的算法会修剪有许多共同字母的单词。

标签： python nlp stemming

【解决方案1】：

您可能会找到更好的答案，但我个人认为 LemmInflect 库是词形还原和变形的最佳选择。

#!pip install lemminflect
from lemminflect import getLemma, getInflection, getAllLemmas

word = 'testing'
lemma = list(lemminflect.getAllLemmas(word, upos='NOUN').values())[0]
inflect = lemminflect.getInflection(lemma[0], tag='VBD')

print(word, lemma, inflect)

testing ('test',) ('tested',)

我会避免使用词干提取，因为如果您想使用语言模型或只是使用任何上下文进行文本分类，它并不是很有用。 Stemming 和 Lemmatization 都会生成屈折词的词根形式。 不同之处在于词干可能不是实际单词，而引理是实际语言单词。

屈折变化与引理相反。

sentence = ['I', 'am', 'testing', 'my', 'new', 'library']

def l(sentence):
    lemmatized_sent = []
    for i in sentence:
        try: lemmatized_sent.append(list(getAllLemmas(i, upos='NOUN').values())[0][0])
        except: lemmatized_sent.append(i)
    return lemmatized_sent

l(sentence)

['I', 'be', 'test', 'my', 'new', 'library']

#To apply to dataframe use this
df['sentences'].apply(l)

请阅读 documentation 了解 LemmInflect。你可以用它做更多的事情。

【讨论】：

非常感谢！您是否知道如何将其应用于整个列表列？
更新了我的一个句子的答案。您可以创建一个函数并将其应用于 pandas 列。
如果它解决了您的问题，请标记答案。谢谢，祝你好运。
再次感谢您。我无法编辑答案，但我认为您应该在导入中添加 getAllLemmas