特征选择嵌入方法显示错误特征答案

【问题标题】：features selection embedded method showing wrong features特征选择嵌入方法显示错误特征
【发布时间】：2021-01-31 18:43:43
【问题描述】：

在功能选择（嵌入式方法）中，我得到了错误的功能。

特征选择代码：

# create the random forest model
model = RandomForestRegressor(n_estimators=120)

# fit the model to start training.
model.fit(X_train[_columns], X_train['delay_in_days'])

# get the importance of the resulting features.
importances = model.feature_importances_

# create a data frame for visualization.
final_df = pd.DataFrame({"Features": X_train[_columns].columns, "Importances":importances})
final_df.set_index('Importances')

# sort in descending order 
final_df = final_df.sort_values('Importances',ascending=False)

#visualising feature importance
pd.Series(model.feature_importances_, index=X_train[_columns].columns).nlargest(10).plot(kind='barh')

_columns #my some selected features

enter image description here

这里是功能列表，您可以看到 total_open_amount 是非常重要的功能但是当我在我的模型中加入前 3 个功能时，我得到了 -ve R2_Score。但如果我删除 total_open_amount 从我的模型中我得到了不错的 R2_Score。

我的问题是什么原因造成的？（所有数据训练、测试都是从 size=100000 的数据集中随机选择的）

clf = RandomForestRegressor()
clf.fit(x_train, y_train)

# Predicting the Test Set Results
predicted = clf.predict(x_test)

【问题讨论】：

标签： python machine-learning random-forest

【解决方案1】：

这是一个有根据的猜测，因为您没有提供数据本身。查看您的功能名称，最重要的功能是名称客户和总开放量。我想这些是具有很多独特价值的功能。

如果您检查 help page 的随机森林，它确实提到：

警告：基于杂质的特征重要性可能会误导高基数特征（许多唯一值）。看 sklearn.inspection.permutation_importance 作为替代方案。

出版物by Strobl et al中也提到了这一点：

我们表明随机森林变量重要性度量是一种明智的方法在许多应用程序中进行变量选择的手段，但不是在潜在预测变量变化的情况下可靠他们的衡量尺度或他们的类别数量。

我会尝试排列重要性，看看是否得到相同的结果。

【讨论】：