为什么 GridSearchCv 在相同的代码中表现不同答案

【问题标题】：why GridSearchCv behave different in same code为什么 GridSearchCv 在相同的代码中表现不同
【发布时间】：2019-08-14 12:57:53
【问题描述】：

我正在尝试调用 GridSearchCV 以获得最佳估算器如果我这样调用参数

clf = DecisionTreeClassifier(random_state=42)

parameters = {'max_depth':[2,3,4,5,6,7,8,9,10],\
'min_samples_leaf':[2,3,4,5,6,7,8,9,10],\
'min_samples_split':[2,3,4,5,6,7,8,9,10]}

scorer = make_scorer(f1_score)

grid_obj = GridSearchCV(clf, parameters, scoring=scorer)

grid_fit = grid_obj.fit(X_train, y_train)

best_clf = grid_fit.best_estimator_

best_clf.fit(X_train, y_train)

best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)

print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, 
y_test))

结果是

The training F1 Score is 0.784810126582
The testing F1 Score is 0.72

对于相同的数据，结果会有所不同我只将 [2,3,4,5,6,7,8,9,10] 更改为 [2,4,6,8,10]

clf = DecisionTreeClassifier(random_state=42)

parameters = {'max_depth':[2,4,6,8,10],'min_samples_leaf':[2,4,6,8,10],\
          'min_samples_split':[2,4,6,8,10] }

scorer = make_scorer(f1_score)

grid_obj = GridSearchCV(clf, parameters, scoring=scorer)
grid_fit = grid_obj.fit(X_train, y_train)
best_clf = grid_fit.best_estimator_
best_clf.fit(X_train, y_train)
best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)

print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, y_test))

结果

The training F1 Score is 0.814814814815
The testing F1 Score is 0.8

对 GridsearchCV 的工作原理感到困惑

【问题讨论】：

在不同的地方可能会发生随机性。例如，在您的训练/测试拆分中可能会产生不同的结果
您是否对为什么这两种情况的分数不同感到困惑，或者更确切地说是为什么第一种低于第二种？
为什么分数不同

标签： python machine-learning scikit-learn gridsearchcv

【解决方案1】：

通过更改网格搜索分析的值，您将针对不同的超参数集评估和比较您的模型。请记住 GridSearch 最终所做的是选择最佳的超参数集。

因此，在您的代码中，grid_fit.best_estimator_ 可能是不同的模型，这很自然地解释了为什么它们会在训练集和测试集上产生不同的分数。

你可能在第一种情况下

clf = DecisionTreeClassifier(max_depth = 3, min_samples_leaf = 5, min_samples_split = 9)

在第二种情况下

clf = DecisionTreeClassifier(max_depth = 2, min_samples_leaf = 4, min_samples_split = 8)

（要检查它，您可以在每种情况下都使用grid_fit.best_params_）。

但是，您确实应该在第一种情况下获得更高的分数，因为您的第二次网格搜索使用的是第一次的参数子集。就像上面提到的@Attack68，这很可能是因为你在每一步都无法控制的随机性。

【讨论】：