【发布时间】:2017-10-27 16:13:52
【问题描述】:
我正在尝试创建一个分类器来对网站进行分类。我是第一次这样做,所以对我来说这一切都很新鲜。目前我正在尝试在网页的几个部分(例如标题、文本、标题)上做一些词袋。它看起来像这样:
from sklearn.feature_extraction.text import CountVectorizer
countvect_text = CountVectorizer(encoding="cp1252", stop_words="english")
countvect_title = CountVectorizer(encoding="cp1252", stop_words="english")
countvect_headings = CountVectorizer(encoding="cp1252", stop_words="english")
X_tr_text_counts = countvect_text.fit_transform(tr_data['text'])
X_tr_title_counts = countvect_title.fit_transform(tr_data['title'])
X_tr_headings_counts = countvect_headings.fit_transform(tr_data['headings'])
from sklearn.feature_extraction.text import TfidfTransformer
transformer_text = TfidfTransformer(use_idf=True)
transformer_title = TfidfTransformer(use_idf=True)
transformer_headings = TfidfTransformer(use_idf=True)
X_tr_text_tfidf = transformer_text.fit_transform(X_tr_text_counts)
X_tr_title_tfidf = transformer_title.fit_transform(X_tr_title_counts)
X_tr_headings_tfidf = transformer_headings.fit_transform(X_tr_headings_counts)
from sklearn.naive_bayes import MultinomialNB
text_nb = MultinomialNB().fit(X_tr_text_tfidf, tr_data['class'])
title_nb = MultinomialNB().fit(X_tr_title_tfidf, tr_data['class'])
headings_nb = MultinomialNB().fit(X_tr_headings_tfidf, tr_data['class'])
X_te_text_counts = countvect_text.transform(te_data['text'])
X_te_title_counts = countvect_title.transform(te_data['title'])
X_te_headings_counts = countvect_headings.transform(te_data['headings'])
X_te_text_tfidf = transformer_text.transform(X_te_text_counts)
X_te_title_tfidf = transformer_title.transform(X_te_title_counts)
X_te_headings_tfidf = transformer_headings.transform(X_te_headings_counts)
accuracy_text = text_nb.score(X_te_text_tfidf, te_data['class'])
accuracy_title = title_nb.score(X_te_title_tfidf, te_data['class'])
accuracy_headings = headings_nb.score(X_te_headings_tfidf, te_data['class'])
这很好用,而且我得到了预期的准确度。但是,正如您可能已经猜到的那样,这看起来很混乱并且充满了重复。那么我的问题是,有没有办法写得更简洁?
此外,我不确定如何将这三个特征组合成一个单一的多项分类器。我尝试将 tfidf 值列表传递给MultinomialNB().fit(),但显然这是不允许的。
也可以选择为特征添加权重,这样在最终的分类器中,一些向量比其他向量具有更高的重要性。
我猜我需要pipeline,但我完全不确定在这种情况下应该如何使用它。
【问题讨论】:
标签: python scikit-learn text-classification supervised-learning multinomial