使用 Gridsearch 进行超参数搜索，给出不起作用的参数值答案

【问题标题】：Hyperparameter search with Gridsearch giving parameter values that don't work使用 Gridsearch 进行超参数搜索，给出不起作用的参数值
【发布时间】：2021-05-08 15:41:52
【问题描述】：

我正在使用 CountVectorizer 和 RandomForestClassifier 使用 scikit-learn 的 GridSearch 运行超参数搜索。超参数搜索网格如下所示：

grid = {
    'vectorizer__ngram_range': [(1, 1)],
    'vectorizer__stop_words': [None, german_stop_words],
    'vectorizer__max_df': [0.25, 0.5, 0.75, 1],
    'vectorizer__min_df': [0.01, 0.1, 1, 5, 10],
    'vectorizer__max_features': [None,100,1000, 1500],
    'classifier__class_weight': ['balanced', 'balanced_subsample', None],
    'classifier__n_jobs': [-1],
    'classifier__n_estimators': [100, 190, 250]
    
    }

gridsearch 一直运行到最后，并给了我一个 best_params 结果。我已经运行了几次，得出了不同的结果。在运行期间我有时会遇到这些错误

  warnings.warn("Estimator fit failed. The score on this train-test"
/root/complex_semantics/lib/python3.8/site-packages/sklearn/model_selection/_validation.py:548: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 330, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 292, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/root/complex_semantics/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 740, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/feature_extraction/text.py", line 1213, in fit_transform
    raise ValueError(
ValueError: max_df corresponds to < documents than min_df

我认为这是正常的，因为有些值没有很好地混合。但是在获得最佳参数并使用它们运行模型后几次，我得到一个错误，告诉我 max_df 和 min_df 的值不正确，因为使用 max_df 选择的文档数量低于使用 min_df 的数量。

为什么它在使用相同数据集的超参数搜索过程中运行正确，而不是正常运行？

有什么想法吗？有没有办法避免这种情况？

这是 GridSearch 的代码

pipeline = Pipeline([('vectorizer', CountVectorizer()),('classifier', RandomForestClassifier())])

scoring_function = make_scorer(matthews_corrcoef)
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring=scoring_function, n_jobs=-1, cv=5)
grid_search.fit(X=train_text, y=train_labels)
print("-----------")
print(grid_search.best_score_)
print(grid_search.best_params_)

【问题讨论】：

标签： python-3.x scikit-learn grid-search

【解决方案1】：

max_df 中的值小于min_df 中的值。

默认的max_df 是1.0，这意味着忽略出现在超过100% 的文档中的术语。

min_df 用于删除过于偶尔出现的术语。

让我们看看你的情况是什么。

'vectorizer__max_df': [0.25, 0.5, 0.75, 1],
'vectorizer__min_df': [0.01, 0.1, 1, 5, 10],

让我们看一个例子。

max_df = 0.25 表示“忽略出现在文档中超过 25% 的术语”
min_df = 0.01 表示“忽略出现在文档中少于1% 的术语”。

我看到的问题是min_df 中的5 和10。

min_df = 5 表示“忽略出现在少于 5 个文档中的术语”。
min_df = 10 表示“忽略出现在少于 10 个文档中的术语”。

该错误甚至会告诉您此 ValueError: max_df corresponds to < documents than min_df 可能来自在 min_df 中使用 10 或 5，因为您的文档总数可能少于这些值。

所以我建议对max_df 和min_df 都坚持浮点值（百分比），也许对vectorizer__min_df 使用值[0.01, 0.1, 0.2]。

【讨论】：