【发布时间】:2020-10-10 23:27:20
【问题描述】:
使用 UCI 人类活动识别数据集,我正在尝试生成一个决策树分类器模型。在默认参数和 random_state 设置为 156 的情况下,模型返回以下精度:
dt_clf = DecisionTreeClassifier(random_state=156)
dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)
print('DecisionTree Accuracy Score: {0:.4f}'.format(accuracy_score(y_test, pred)))
输出:
DecisionTree Accuracy Score: 0.8548
使用任意一组 max_depth,我运行 GridSearchCV 以找到其最佳参数:
params = {
'max_depth': [6, 8, 10, 12, 16, 20, 24]
}
grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1)
grid_cv.fit(X_train, y_train)
print('GridSearchCV Best Score: {0:.4f}'.format(grid_cv.best_score_))
print('GridSearchCV Best Params:', grid_cv.best_params_)
输出:
Fitting 5 folds for each of 7 candidates, totalling 35 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1
concurrent workers. [Parallel(n_jobs=1)]: Done 35 out of 35 |
elapsed: 1.6min finished GridSearchCV Best Score: 0.8513 GridSearchCV
Best Params: {'max_depth': 16}
现在,我想在单独的测试集上测试“最佳参数”max_depth=16,看看它是否真的是提供的列表 max_depth = [6, 8, 10, 12, 16, 20, 24] 中的最佳参数。
max_depths = [6, 8, 10, 12, 16, 20, 24]
for depth in max_depths:
dt_clf = DecisionTreeClassifier(max_depth=depth, random_state=156)
dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred)
print('max_depth = {0} Accuracy: {1:.4f}'.format(depth, accuracy))
但令我惊讶的是,测试表明“最佳参数”max_depth=16 与同类产品中最好的相差无几:
输出:
max_depth = 6 Accuracy: 0.8558
max_depth = 8 Accuracy: 0.8707
max_depth = 10 Accuracy: 0.8673
max_depth = 12 Accuracy: 0.8646
max_depth = 16 Accuracy: 0.8575
max_depth = 20 Accuracy: 0.8548
max_depth = 24 Accuracy: 0.8548
我知道 GridSearchCV 的最佳参数是基于交叉验证训练集 (X_train, y_train) 得到的平均测试分数,但它不应该在一定程度上反映在测试集上吗?我认为 UCI 数据集没有不平衡,因此数据集偏差应该不是问题。
【问题讨论】:
标签: python machine-learning scikit-learn decision-tree grid-search