【发布时间】:2022-06-11 23:15:32
【问题描述】:
我想用 Python 和 Scikit-Learn 库创建一个自定义的 CountVectorizer。 我编写了一个代码,它使用 TextBlob lib 从 Pandas 数据帧中提取短语,我想从我的 Vecotrizer 中计算这些短语。
我的代码:
from textblob import TextBlob
import pandas as pd
my_list = ["I want to buy a nice bike for my girl. She broke her old bike last year.",
"I had a great time watching that movie last night. We shouuld do the same next week",
"Where can I buy some tasty apples and oranges? I want to head healthy food",
"The songs from this bend are boring, lets play some other music from some good bands",
"If you buy this now, you will get 3 different products for free in the next 10 days.",
"I am living in a small house in France, and my wish is to learn how to ski and snowboad",
"It is time to invest in some tech stock. The stock market is will become very hot in the next few months",
"This player won all 4 grand slam tournaments last year. He is the best player in the world!"]
df = pd.DataFrame({"TEXT": my_list})
final_list = []
for text in df.TEXT:
blob = TextBlob(text)
result_list = blob.noun_phrases
print(result_list)
final_list.extend(result_list)
print(final_list)
我知道在使用 Sciki-Learn 时可以创建这样的 CountVectorizer:
features = df.iloc[:, :-1]
results = df.iloc[:, -1]
# vectorizer
transformerVectoriser = ColumnTransformer(transformers=[('vector title', CountVectorizer(analyzer='word', ngram_range=(2, 4), max_features = 1000, stop_words = 'english'), 'TEXT')])
clf = RandomForestClassifier(max_depth = 75, n_estimators = 125, random_state = 42)
pipeline = Pipeline([('transformer', transformerVectoriser),
('classifier', clf)])
cv_score_acc = cross_val_score(pipeline, features, results, cv=5, scoring = 'accuracy')
但是如何从之前提取的短语创建矢量化器?
例如,从my_list 中的文本中提取的短语是:
['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']
如何创建自定义计数矢量化器是我上面列出的短语的特征?
【问题讨论】:
标签: python machine-learning scikit-learn