【发布时间】:2017-04-28 01:51:11
【问题描述】:
我一直在尝试使用 sklearn 网格搜索和管道功能,并注意到返回的 f1_score 与我使用硬编码参数生成的 f1_score 不匹配。寻求帮助以了解为什么会这样。
数据背景:两列.csv文件
客户评论(字符串),类别标签(字符串)
使用开箱即用的 sklearn 词袋方法,无需对文本进行预处理,仅使用 countVectorizer。
硬编码模型...
get .csv data into dataFrame
data_file = 'comment_data_basic.csv'
data = pd.read_csv(data_file,header=0,quoting=3)
#remove data without 'web issue' or 'product related' tag
data = data.drop(data[(data.tag != 'WEB ISSUES') & (data.tag != 'PRODUCT RELATED')].index)
#split dataFrame into two series
comment_data = data['comment']
tag_data = data['tag']
#split data into test and train samples
comment_train, comment_test, tag_train, tag_test = train_test_split(
comment_data, tag_data, test_size=0.33)
#build count vectorizer
vectorizer = CountVectorizer(min_df=.002,analyzer='word',stop_words='english',strip_accents='unicode')
vectorizer.fit(comment_data)
#vectorize features and convert to array
comment_train_features = vectorizer.transform(comment_train).toarray()
comment_test_features = vectorizer.transform(comment_test).toarray()
#train LinearSVM Model
lin_svm = LinearSVC()
lin_svm = lin_svm.fit(comment_train_features,tag_train)
#make predictions
lin_svm_predicted_tags = lin_svm.predict(comment_test_features)
#score models
lin_svm_score = round(f1_score(tag_test,lin_svm_predicted_tags,average='macro'),3)
lin_svm_accur = round(accuracy_score(tag_test,lin_svm_predicted_tags),3)
lin_svm_prec = round(precision_score(tag_test,lin_svm_predicted_tags,average='macro'),3)
lin_svm_recall = round(recall_score(tag_test,lin_svm_predicted_tags,average='macro'),3)
#write out scores
print('Model f1Score Accuracy Precision Recall')
print('------ ------- -------- --------- ------')
print('LinSVM {f1:.3f} {ac:.3f} {pr:.3f} {re:.3f} '.format(f1=lin_svm_score,ac=lin_svm_accur,pr=lin_svm_prec,re=lin_svm_recall))
f1_score 输出一般在 0.86 左右(取决于随机种子值)
现在,如果我基本上用网格搜索和管道重建相同的输出......
#get .csv data into dataFrame
data_file = 'comment_data_basic.csv'
data = pd.read_csv(data_file,header=0,quoting=3)
#remove data without 'web issue' or 'product related' tag
data = data.drop(data[(data.tag != 'WEB ISSUES') & (data.tag != 'PRODUCT RELATED')].index)
#build processing pipeline
pipeline = Pipeline([
('vect', CountVectorizer()),
('clf', LinearSVC()),])
#define parameters to be used in gridsearch
parameters = {
#'vect__min_df': (.001,.002,.003,.004,.005),
'vect__analyzer': ('word',),
'vect__stop_words': ('english', None),
'vect__strip_accents': ('unicode',),
#'clf__C': (1,10,100,1000),
}
if __name__ == '__main__':
grid_search = GridSearchCV(pipeline,parameters,scoring='f1_macro',n_jobs=1)
grid_search.fit(data['comment'],data['tag'])
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_params = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_params[param_name]))
返回的f1_score接近0.73,所有模型参数相同。我的理解是网格搜索在内部应用了交叉验证方法,但我的猜测是,与在原始代码中使用 test_train_split 相比,它使用的任何方法都不同。然而,从 0.83 -> 0.73 的下降对我来说感觉很大,我希望对自己的结果充满信心。
任何见解将不胜感激。
【问题讨论】:
-
如果您可以向我们提供一小部分数据,我们可以尝试重现此内容。事实上,我们要么必须自己编造数据,要么做出假设。请阅读How to create a Minimal, Complete, and Verifiable example 获取指导。
标签: python scikit-learn cross-validation