【发布时间】:2016-04-15 02:20:41
【问题描述】:
在使用this 作为垃圾邮件分类模型的同时,我想添加主题加正文的附加功能。
我的所有功能都在 pandas 数据框中。例如,主题为 df['Subject'],正文为 df['body_text'],垃圾邮件/火腿标签为 df['ham/spam']
我收到以下错误: TypeError: 'FeatureUnion' 对象不可迭代
如何在通过管道函数运行 df['Subject'] 和 df['body_text'] 作为特征?
from sklearn.pipeline import FeatureUnion
features = df[['Subject', 'body_text']].values
combined_2 = FeatureUnion(list(features))
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
('tfidf_transformer', TfidfTransformer()),
('classifier', MultinomialNB())])
pipeline.fit(combined_2, df['ham/spam'])
k_fold = KFold(n=len(df), n_folds=6)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
train_text = combined_2.iloc[train_indices]
train_y = df.iloc[test_indices]['ham/spam'].values
test_text = combined_2.iloc[test_indices]
test_y = df.iloc[test_indices]['ham/spam'].values
pipeline.fit(train_text, train_y)
predictions = pipeline.predict(test_text)
prediction_prob = pipeline.predict_proba(test_text)
confusion += confusion_matrix(test_y, predictions)
score = f1_score(test_y, predictions, pos_label='spam')
scores.append(score)
【问题讨论】:
标签: pandas scikit-learn sklearn-pandas