scikit learn中的feature_importances，如何选择正确的参数？答案

【问题标题】：Feature_importances in scikit learn, how choose correct parameters?scikit learn中的feature_importances，如何选择正确的参数？
【发布时间】：2017-01-05 09:28:20
【问题描述】：

我的任务是了解哪些特征（位于 X 数据集的列中）最适合预测目标变量 - y。我决定在 RandomForestClassifier 中使用 feature_importances_。当 max_depth=10 且 n_estimators = 50 时，RandomForestClassifier 具有最佳得分（aucroc）。使用具有最佳参数或默认参数的 feature_importances_ 是否正确？为什么？ feature_importances_ 是如何工作的？

例如，具有最佳和默认参数的模型。

model = RandomForestClassifier(max_depth=10,n_estimators = 50)
model.fit(X, y)
feature_imp = pd.DataFrame(model.feature_importances_, index=X.columns, columns=["importance"])

model = RandomForestClassifier()
model.fit(X, y)
feature_imp = pd.DataFrame(model.feature_importances_, index=X.columns, columns=["importance"])

【问题讨论】：

你没有使用特征重要性。这是对每个特征对您的预测的信息量的估计。
正如@cel 所说，feature_importances_ 只会对您的每个列的重要性进行评分。就这样。此外，如果您只是 google scikits-learn 文档，您会发现 here 一个演示如何阅读 feature_importances_。

标签： scikit-learn random-forest feature-selection

【解决方案1】：

我认为您应该使用具有最佳参数的 feature_importances_，这是您要使用的模型。默认参数没有什么特别之处，值得特别对待。至于feature_importances_是如何工作的，可以参考这里scikit-learn作者的回答How are feature_importances in RandomForestClassifier determined?

【讨论】：