【发布时间】:2018-08-14 12:25:32
【问题描述】:
我今天刚刚构建了我的第一个random forest classifier,我正在努力提高它的性能。我正在阅读cross-validation 对避免overfitting 数据的重要性,从而获得更好的结果。我使用sklearn 实现了StratifiedKFold,然而,令人惊讶的是,这种方法导致不太准确。我读过很多帖子表明cross-validating 比train_test_split 效率更高。
估算器:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
K 折:
ss = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in ss.split(features, labels):
train_features, test_features = features[train_index], features[test_index]
train_labels, test_labels = labels[train_index], labels[test_index]
TTS:
train_feature, test_feature, train_label, test_label = \
train_test_split(features, labels, train_size=0.8, test_size=0.2, random_state=42)
以下是结果:
简历:
AUROC: 0.74
Accuracy Score: 74.74 %.
Specificity: 0.69
Precision: 0.75
Sensitivity: 0.79
Matthews correlation coefficient (MCC): 0.49
F1 Score: 0.77
TTS:
AUROC: 0.76
Accuracy Score: 76.23 %.
Specificity: 0.77
Precision: 0.79
Sensitivity: 0.76
Matthews correlation coefficient (MCC): 0.52
F1 Score: 0.77
这真的可能吗?还是我错误地设置了模型?
另外,这是使用交叉验证的正确方法吗?
【问题讨论】:
标签: python machine-learning scikit-learn cross-validation