【发布时间】:2016-08-30 12:30:17
【问题描述】:
如何在 scikit learn 中使用 FeatureUnion,以便 Gridsearch 可以选择性地处理其部分?
下面的代码工作并设置了一个 FeatureUnion,它带有一个用于单词的 TfidfVectorizer 和一个用于字符的 TfidfVectorizer。
在进行 Gridsearch 时,除了测试定义的参数空间之外,我还想只测试带有 ngram_range 参数的“vect__wordvect”(没有用于字符的 TfidfVectorizer),并且也只测试带有小写参数 True 和 False,另一个 TfidfVectorizer 被禁用。
编辑:基于 maxymoo 建议的完整代码示例。
如何做到这一点?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import fetch_20newsgroups
# setup the featureunion
wordvect = TfidfVectorizer(analyzer='word')
lettervect = CountVectorizer(analyzer='char')
featureunionvect = FeatureUnion([("lettervect", lettervect), ("wordvect", wordvect)])
# setup the pipeline
classifier = LogisticRegression(class_weight='balanced')
pipeline = Pipeline([('vect', featureunionvect), ('classifier', classifier)])
# gridsearch parameters
parameters = {
'vect__wordvect__ngram_range': [(1, 1), (1, 2)], # commenting out these two lines
'vect__lettervect__lowercase': [True, False], # runs, but there is no parameterization anymore
'vect__transformer_list': [[('wordvect', wordvect)],
[('lettervect', lettervect)],
[('wordvect', wordvect), ('lettervect', lettervect)]]}
gs_clf = GridSearchCV(pipeline, parameters)
# data
newsgroups_train = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'])
# gridsearch CV
gs_clf = GridSearchCV(pipeline, parameters)
gs_clf = gs_clf.fit(newsgroups_train.data, newsgroups_train.target)
for score in gs_clf.grid_scores_:
print "gridsearch scores: ", score
【问题讨论】:
标签: python scikit-learn