Sklearn TfidfVectorizer 中的功能是否需要最小术语长度答案

【问题标题】：Is there a minimum term length required for features in Sklearn TfidfVectorizerSklearn TfidfVectorizer 中的功能是否需要最小术语长度
【发布时间】：2020-02-15 22:37:52
【问题描述】：

我有一个 pandas 数据框，其中包含我试图计算 Tfidf 的句子：

df['sentence'] = ['buy donuts', 'buy donuts', 'buy donuts', 'buy donuts', 'buy donuts', 'buy donuts', 'buy donuts', 'buy donuts', 'buy donuts', 'buy donuts', 'purchase donuts', 'purchase donuts', 'purchase donuts', 'purchase donuts', 'purchase donuts', 'buy donut', 'buy a donut', 'buy 2 donuts', 'buy 2 donuts', 'buy 2 donuts', 'buy 12 donuts', 'buy 12 donuts', 'buy 12 donuts', 'purchase 2 donuts', 'purchase 12 donuts', 'i want to buy 2 donuts', 'i want to buy 12 donuts', 'i want to buy donuts', 'i want to buy some donuts', 'buy some donuts', 'buy two donuts', 'buy two donuts', 'buy two donuts', 'buy twelve donuts', 'buy twelve donuts', 'buy twelve donuts', 'purchase two donuts', 'purchase twelve donuts', 'i want to buy two donuts', 'i want to buy twelve donuts']

我首先对这些句子进行词形还原（代码如下），然后将词形化列表提供给 sklearn 的 tfidfvectorizer。

但是，我注意到一个奇怪的异常情况，它没有将某些术语作为特征包括在内，即使 min_df 和 max_df 设置为它们的默认值以包含所有术语。当我运行 get_feature_names() 时，除了“i”、“a”和“2”之外，每个术语都被列为特征：

['12', 'buy', 'donut', 'purchase', 'some', 'to', 'twelve', 'two', 'want']

我不会删除停用词。出于我的目的，“2”非常有区别，tfidfvectorizer 中的特征是否有最小术语长度？我如何才能将这些条款作为功能包含在内？

nlp = spacy.load("en", disable=['ner'])
vect = TfidfVectorizer(binary=True)

## Load in data
df = pd.read_csv('buy donuts.csv', encoding='utf-8')
df.columns = df.columns.str.lower()

## Normalize sentences
df['sentence'] = df['sentence'].str.replace(r"[^\w\s']", '').str.lower().str.strip().replace('', np.nan)

df = df.dropna(subset=['unit name', 'sentence'])

## Get lemmas for tfidf
def lemmas(x):
    docs = nlp(x)
    sents_lemma = [token.lemma_ for token in docs]
    return ' '.join(sents_lemma)

df['lemmas'] = df.index.map(df['sentence'].apply(lemmas))

## Get tfidf and calculate scores
tfidf = vect.fit_transform(df.lemmas.values.tolist())
scores = ((tfidf * tfidf.T).A).mean(axis=0)

print(vect.get_feature_names())

【问题讨论】：

这是一个有趣的问题。就我而言，我想排除一个字母标记，因为它们会是噪音，但这实际上取决于语料库

标签： python python-3.x pandas sklearn-pandas tfidfvectorizer

【解决方案1】：

检查您的 TfidfVectorizer 使用的正则表达式，您没有明确设置。这可以通过token_pattern 参数访问（或更改）。

您可以看到(?u)\b\w\w+\b 会阻止2 或I 被识别。

我不确定哪个正则表达式最适合您的用例，也许您可以尝试一下。但是，以下将捕获buy 2 donut

的情况

(?u)\b\w+\b

因此，无论如何要回答一般性问题，您可以制作您的 token_pattern，以强制执行我认为的最小长度（甚至是最大长度）。

【讨论】：