【发布时间】:2021-05-08 15:41:52
【问题描述】:
我正在使用 CountVectorizer 和 RandomForestClassifier 使用 scikit-learn 的 GridSearch 运行超参数搜索。超参数搜索网格如下所示:
grid = {
'vectorizer__ngram_range': [(1, 1)],
'vectorizer__stop_words': [None, german_stop_words],
'vectorizer__max_df': [0.25, 0.5, 0.75, 1],
'vectorizer__min_df': [0.01, 0.1, 1, 5, 10],
'vectorizer__max_features': [None,100,1000, 1500],
'classifier__class_weight': ['balanced', 'balanced_subsample', None],
'classifier__n_jobs': [-1],
'classifier__n_estimators': [100, 190, 250]
}
gridsearch 一直运行到最后,并给了我一个 best_params 结果。我已经运行了几次,得出了不同的结果。在运行期间我有时会遇到这些错误
warnings.warn("Estimator fit failed. The score on this train-test"
/root/complex_semantics/lib/python3.8/site-packages/sklearn/model_selection/_validation.py:548: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 330, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 292, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "/root/complex_semantics/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
return self.func(*args, **kwargs)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 740, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/feature_extraction/text.py", line 1213, in fit_transform
raise ValueError(
ValueError: max_df corresponds to < documents than min_df
我认为这是正常的,因为有些值没有很好地混合。但是在获得最佳参数并使用它们运行模型后几次,我得到一个错误,告诉我 max_df 和 min_df 的值不正确,因为使用 max_df 选择的文档数量低于使用 min_df 的数量。
为什么它在使用相同数据集的超参数搜索过程中运行正确,而不是正常运行?
有什么想法吗?有没有办法避免这种情况?
这是 GridSearch 的代码
pipeline = Pipeline([('vectorizer', CountVectorizer()),('classifier', RandomForestClassifier())])
scoring_function = make_scorer(matthews_corrcoef)
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring=scoring_function, n_jobs=-1, cv=5)
grid_search.fit(X=train_text, y=train_labels)
print("-----------")
print(grid_search.best_score_)
print(grid_search.best_params_)
【问题讨论】:
标签: python-3.x scikit-learn grid-search