使用 Scikit-Learn 创建自定义计数向量器答案

【问题标题】：Creating custom Count Vectorizer with Scikit-Learn使用 Scikit-Learn 创建自定义计数向量器
【发布时间】：2022-06-11 23:15:32
【问题描述】：

我想用 Python 和 Scikit-Learn 库创建一个自定义的 CountVectorizer。我编写了一个代码，它使用 TextBlob lib 从 Pandas 数据帧中提取短语，我想从我的 Vecotrizer 中计算这些短语。

我的代码：

from textblob import TextBlob
import pandas as pd

my_list = ["I want to buy a nice bike for my girl. She broke her old bike last year.", 
        "I had a great time watching that movie last night. We shouuld do the same next week", 
        "Where can I buy some tasty apples and oranges? I want to head healthy food", 
        "The songs from this bend are boring, lets play some other music from some good bands", 
        "If you buy this now, you will get 3 different products for free in the next 10 days.", 
        "I am living in a small house in France, and my wish is to learn how to ski and snowboad",
        "It is time to invest in some tech stock. The stock market is will become very hot in the next few months",
        "This player won all 4 grand slam tournaments last year. He is the best player in the world!"]

df = pd.DataFrame({"TEXT": my_list})

final_list = []
for text in df.TEXT:
    
    blob = TextBlob(text)
    result_list = blob.noun_phrases
    print(result_list)
    final_list.extend(result_list)

print(final_list)

我知道在使用 Sciki-Learn 时可以创建这样的 CountVectorizer：

features = df.iloc[:, :-1]
results = df.iloc[:, -1]

# vectorizer
transformerVectoriser = ColumnTransformer(transformers=[('vector title', CountVectorizer(analyzer='word', ngram_range=(2, 4), max_features = 1000, stop_words = 'english'), 'TEXT')])

clf = RandomForestClassifier(max_depth = 75, n_estimators = 125, random_state = 42)
pipeline = Pipeline([('transformer', transformerVectoriser),
                     ('classifier', clf)])


cv_score_acc = cross_val_score(pipeline, features, results, cv=5, scoring = 'accuracy')

但是如何从之前提取的短语创建矢量化器？例如，从my_list 中的文本中提取的短语是：

['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']

如何创建自定义计数矢量化器是我上面列出的短语的特征？

【问题讨论】：

标签： python machine-learning scikit-learn

【解决方案1】：

如果你初始化CountVectorizer(vocabulary=noun_phrases, ...)你应该得到想要的效果：

noun_phrases = ['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']

cv = CountVectorizer(analyzer='word', vocabulary=noun_phrases, ngram_range=(2, 4))
res = cv.transform(my_list)
res.todense()

>>>
matrix([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

【讨论】：

如果我添加词汇，为什么要添加 ngram_range？如果我的词汇表（例如）在 1 到 6 个单词之间有 6000 个短语，我为什么要添加 ngram_range？
因为CountVectorizer 在查找字典之前会进行一些文本处理。它首先删除stopwords，然后创建不同长度的滑动窗口，然后才查找字典。因此，如果您的 noun phrases 字典中有 1-6 个字长的条目，则必须将 ngram_range 设置为 (1,6)。您可以看到，在带有“france”的句子中，该特征没有被计算在内，确切地说，它是一个单词，CountVectorizer 被设置为仅查找 4-gram 的 bigrams。
还有一个问题，如果我的短语是“red apple”，而在文本中我有“red apples”，我应该将分析器更改为“char”吗？

【解决方案2】：

可以自定义sklearn CountVectorizer的tokenizer功能

def noun_phrases_tokenizer(text):
    return TextBlob(text).noun_phrases
    
count_vectorizer = CountVectorizer(tokenizer=noun_phrases_tokenizer)
transformerVectoriser = ColumnTransformer(transformers=[('count', count_vectorizer, 'TEXT')])
transformerVectoriser.fit_transform(df)

print(transformerVectoriser.transformers_[0][1].get_feature_names_out())
# ['different products', 'good bands', 'grand slam tournaments', ...]

更新：添加词形还原

import textblob

def lemmatize_noun_phrase(phrase):
    # phrase.lemmatize() not working correctly
    return " ".join([textblob.Word(w).lemmatize() for w in phrase.split(" ")])

def custom_tokenizer(text):
    phrases = textblob.TextBlob(text).noun_phrases
    return [lemmatize_noun_phrase(p) for p in phrases]

print(noun_phrases_tokenizer("I love green apples"))  # ["green apple"]
count_vectorizer = CountVectorizer(tokenizer=custom_tokenizer)

【讨论】：

如果我的短语是“red apple”，而在文本中我有“red apples”，tokenizer 将无法识别，对吧？我应该添加analyzer= word 还是analyzer=char？