随机森林的特定交叉验证答案

【问题标题】：Specific Cross Validation with Random Forest随机森林的特定交叉验证
【发布时间】：2020-02-28 09:49:45
【问题描述】：

我正在使用带有 scikit learn 的随机森林。 RF 过拟合数据，预测结果很差。

过拟合不依赖于 RF 的参数： NBtree, Depth_Tree

过拟合发生在许多不同的参数上（在 grid_search 中测试过）。

补救措施：我调整了初始数据/下采样了一些结果为了影响拟合（手动预处理噪声样本）。

Loop on random generation of RF fits, 

Get RF prediction on the  data for prediction
Select the model which best fits the "predicted data" (not the calibration data).

这个蒙特卡洛斯非常消耗，只是想知道是否还有其他方法可以做随机森林的交叉验证？（即不是超参数优化）。

已编辑

【问题讨论】：

阅读文档即可。特别是this.
当您说您的模型过度拟合时，您使用的是 oob_score 还是准确率？
看混淆矩阵 False Positive/Negative.... 训练没问题。但是，样本外并不一致（通常不好，有时还可以）。

标签： scikit-learn

【解决方案1】：

在 scikit-learn 中使用任何分类器进行交叉验证真的很简单：

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

clf = RandomForestClassifier() #Initialize with whatever parameters you want to

# 10-Fold Cross validation
print np.mean(cross_val_score(clf, X_train, y_train, cv=10))

如果您希望运行网格搜索，您可以通过 GridSearchCV 类轻松实现。为此，您必须提供param_grid，根据文档，它是

以参数名称（字符串）作为键和列表的字典尝试作为值的参数设置，或此类字典的列表，在这种情况下，列表中每个字典跨越的网格是探索了。这可以搜索任何参数序列设置。

所以也许，您可以按如下方式定义您的 param_grid：

param_grid = {
                 'n_estimators': [5, 10, 15, 20],
                 'max_depth': [2, 5, 7, 9]
             }

那么就可以如下使用GridSearchCV类了

from sklearn.model_selection import GridSearchCV

grid_clf = GridSearchCV(clf, param_grid, cv=10)
grid_clf.fit(X_train, y_train)

然后您可以使用grid_clf. best_estimator_ 获得最佳模型，使用grid_clf. best_params_ 获得最佳参数。同样，您可以使用grid_clf.cv_results_

获取网格分数

希望这会有所帮助！

【讨论】：

问题不涉及超参数优化。
它应该来自 sklearn.ensemble import RandomForestClassifier 或 from sklearn.ensemble import RandomForestRegressor
很棒的解释！由于某种原因，最好的“分数”（score={"AUC":"roc_auc","ACC":make_score(accuracy)}不在grid_clf.best_estimator_中，有什么原因吗？