使用 spaCy 时如何同时删除停用词和词形还原？答案

【问题标题】：How to remove stop words and lemmatize at the same time when using spaCy?使用 spaCy 时如何同时删除停用词和词形还原？
【发布时间】：2021-08-14 12:03:31
【问题描述】：

当我使用 spaCy 清理数据时，我运行以下行：

df['text'] = df.sentence.progress_apply(lambda text: " ".join(token.lemma_ for token in nlp(text) if not token.is_stop and token.is_alpha))

如果文本行中的单词不是停用词，则对文本行中的每个单词进行词形还原。问题是 text.lemma_ 在检查令牌是否为停用词后应用于令牌。因此，如果停用词不是词形还原形式，则不会被视为停用词。例如，如果我将“friend”添加到停用词列表中，如果原始标记是“friends”，则输出仍将包含“friend”。简单的解决方案是运行这条线两次。但这听起来很愚蠢。任何人都可以提出一个解决方案来删除第一次运行时不在词形化形式中的停用词？

谢谢！

【问题讨论】：

为什么不简单地进行词形还原，然后再删除停用词？
请注意，您通常不应该对现代 NLP 模型进行这种预处理。您应该只使用原始的自然文本。

标签： python nlp spacy

【解决方案1】：

您可以简单地检查token.lemma_ 是否存在于nlp.Defaults.stop_words 中：

if token.lemma_.lower() not in nlp.Defaults.stop_words

例如：

df['text'] = df.sentence.progress_apply(
    lambda text: 
        " ".join(
            token.lemma_ for token in nlp(text)
                if token.lemma_.lower() not in nlp.Defaults.stop_words and token.is_alpha
        )
)

查看快速测试：

>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")

>>> nlp.Defaults.stop_words.add("friend") # Adding "friend" to stopword list

>>> text = "I have a lot of friends"
>>> " ".join(token.lemma_ for token in nlp(text) if not token.is_stop and token.is_alpha)
'lot friend'

>>> " ".join(token.lemma_ for token in nlp(text) if token.lemma_.lower() not in nlp.Defaults.stop_words and token.is_alpha)
'lot'

如果将大写单词添加到停用词列表中，则需要使用if token.lemma_.lower() not in map(str.lower, nlp.Defaults.stop_words)。

【讨论】：