sklearn随机森林分类器的奇怪行为答案

【问题标题】：strange behavior of sklearn random forest classifiersklearn随机森林分类器的奇怪行为
【发布时间】：2021-08-13 14:25:16
【问题描述】：

我正在尝试使用 sklearn 随机森林分类器（在 python 中），我得到了一些奇怪的结果

我的功能是：

    rf = tree(data_handler.train_dataset, data_handler.train_labels, num_of_estimemtors, tree_depth, 42, tree_max_featrues)        
    # evaluate(rf, data_handler.train_dataset, data_handler.train_labels)
    evaluate(rf, data_handler.test_dataset, data_handler.test_labels)

（对“树”和“评估”的执行见下文）

当第二行有注释时，结果很糟糕：

 0.4772727272727273
[[ 0 23]
 [ 0 21]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        23
           1       0.48      1.00      0.65        21

    accuracy                           0.48        44
   macro avg       0.24      0.50      0.32        44
weighted avg       0.23      0.48      0.31        44

但是，当取消注释这一行时，结果会发生巨大变化：

0.9846153846153847
[[1235    0]
 [  38 1197]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      1235
           1       1.00      0.97      0.98      1235

    accuracy                           0.98      2470
   macro avg       0.99      0.98      0.98      2470
weighted avg       0.99      0.98      0.98      2470

0.5909090909090909
[[ 8 15]
 [ 3 18]]
              precision    recall  f1-score   support

           0       0.73      0.35      0.47        23
           1       0.55      0.86      0.67        21

    accuracy                           0.59        44
   macro avg       0.64      0.60      0.57        44
weighted avg       0.64      0.59      0.56        44

这个函数不会改变 rf（随机森林）。我试图理解这一点半天，但我失败了。这里有什么问题？

函数实现：

def evaluate(rf, x, y):
    pred = rf.predict(x)
    print(accuracy_score(y, pred))
    print(confusion_matrix(y, pred))
    print(classification_report(y, pred))
    return accuracy_score(y, pred)
    
def tree(x, y, est, depth, seed=42, max_features="auto"):
    rf = RandomForestClassifier(n_estimators = est, max_depth=depth, random_state=seed, bootstrap=True, max_features=max_features)
    rf.fit(x,y)
    return rf

【问题讨论】：

自举引入了某种形式的随机性，这个结果在多次运行中是否一致？
是的，是一致的

标签： python machine-learning scikit-learn random-forest

【解决方案1】：

如果我理解正确，当您使用 test_dataset 和 test_labels 评估模型时，您会看到模型的性能指标大幅下降。

这个函数不会改变 rf（随机森林）。我试图理解这一点半天，但我失败了。这里有什么问题？

您没有显示传递给tree 的值，但您的模型可能过度拟合。我建议你用较低的max_depth、更多的n_estimators 或更少的max_features 重新训练rf。事实上，您可以运行Grid Search 来找到超参数的最佳组合。

再一次，我不知道您当前的超参数是什么，但根据个人经验，我发现 max_depth 大约为 5 会导致模型通常泛化良好。

【讨论】：

实际上，我的 max_depth 是 2。我知道模型过度拟合，但这不是我的问题。我的问题是无法解释的性能变化