如果你有文本作为数据,你需要在应用分类器之前做feature extraction。使用来自 sklearn 的 old example:
from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
X_train = newsgroups_train.data
Y_train = newsgroups_train.target
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)
X_test = newsgroups_test.data
Y_test = newsgroups_test.target
数据如下所示:
Y_train
array([0, 1, 1, ..., 1, 1, 1])
X_train[0][:50]
'From: bil@okcforum.osrhe.edu (Bill Conner)\nSubject'
应用矢量化器将文本转换为基本的数字特征,然后训练模型:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
model = LogisticRegression(solver='liblinear', penalty='l1')
model.fit(X_train_vec, Y_train)
pred = model.predict(X_test_vec)
accuracy_score(Y_test,pred)
0.906030855539972