【发布时间】:2018-09-21 11:44:20
【问题描述】:
这可能是一个奇怪的问题,因为我还没有完全理解超参数调整。
目前我正在使用sklearn 的gridSearchCV 来调整randomForestClassifier 的参数,如下所示:
gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)
results = gs.cv_results_
之后,我检查gs 对象中的best_params 和best_score。现在我使用best_params 实例化RandomForestClassifier 并再次使用分层验证来记录指标并打印混淆矩阵:
rf = RandomForestClassifier(n_estimators=1000, min_samples_leaf=7, max_depth=18, criterion='entropy', random_state=42)
accuracy = []
metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]}
counter = 0
print('################################################### RandomForest ###################################################')
for train_index, test_index in skf.split(X_Distances,Y):
X_train, X_test = X_Distances[train_index], X_Distances[test_index]
y_train, y_test = Y[train_index], Y[test_index]
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
precision, recall, fscore, support = np.round(score(y_test, y_pred), 2)
metrics['accuracy'].append(round(accuracy_score(y_test, y_pred), 2))
metrics['precision'].append(precision)
metrics['recall'].append(recall)
metrics['fscore'].append(fscore)
metrics['support'].append(support)
print(classification_report(y_test, y_pred))
matrix = confusion_matrix(y_test, y_pred)
methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png'))
counter = counter+1
meanAcc= round(np.mean(np.asarray(metrics['accuracy'])),2)*100
print('meanAcc: ', meanAcc)
这是一个合理的方法还是我完全错了?
编辑:
我刚刚测试了以下内容:
gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)
这会在best_index = 28 产生best_score = 0.5362903225806451。当我检查索引 28 处 3 折的准确性时,我得到:
- split0: 0.5185929648241207
- split1:0.526686807653575
- split2:0.5637651821862348
这导致平均测试准确度:0.5362903225806451。 best_params:{'criterion': 'entropy', 'max_depth': 21, 'min_samples_leaf': 5}
现在我运行这段代码,它使用提到的 best_params 和分层的 3 折拆分(如 GridSearchCV):
rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, max_depth=21, criterion='entropy', random_state=42)
accuracy = []
metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]}
counter = 0
print('################################################### RandomForest_Gini ###################################################')
for train_index, test_index in skf.split(X_Distances,Y):
X_train, X_test = X_Distances[train_index], X_Distances[test_index]
y_train, y_test = Y[train_index], Y[test_index]
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
precision, recall, fscore, support = np.round(score(y_test, y_pred))
metrics['accuracy'].append(accuracy_score(y_test, y_pred))
metrics['precision'].append(precision)
metrics['recall'].append(recall)
metrics['fscore'].append(fscore)
metrics['support'].append(support)
print(classification_report(y_test, y_pred))
matrix = confusion_matrix(y_test, y_pred)
methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png'))
counter = counter+1
meanAcc= np.mean(np.asarray(metrics['accuracy']))
print('meanAcc: ', meanAcc)
指标字典产生完全相同的准确度(split0:0.5185929648241207,split1:0.526686807653575,split2:0.5637651821862348)
但是平均计算有点偏离:0.5363483182213101
【问题讨论】:
标签: python machine-learning hyperparameters