从随机森林（和决策树）获取标签特定特征的重要性答案

【问题标题】：Get label specific feature importances from Random Forest (and Decision Tree)从随机森林（和决策树）获取标签特定特征的重要性
【发布时间】：2021-11-05 01:18:01
【问题描述】：

我想从随机森林或决策树中检索标签/类特定特征的重要性，而不用训练 n_class 乘以 one vs. rest 模型。作为一个库，我在 Python 中使用 scikit-learn。这些模型是tree.DecisionTreeClassifier() 或RandomForestClassifier() 类的一个实例。由于feature_importances_ 属性只返回整个模型中每个特征的重要性，不幸的是，这对我没有太大帮助！

【问题讨论】：

标签： python machine-learning scikit-learn random-forest decision-tree

【解决方案1】：

要获取标签，您可以创建pandas.Series 并将索引指定为训练数据的列名。 RandomForestClassifier 返回的重要功能是保持训练数据列的顺序。

rfc = RandomForestClassifier(n_estimators=500)
rfc.fit(X,y)
# In the following pandas series you can mention index as X.columns
importances = pd.Series(rfc.feature_importances_,index=X.columns)

print(importances)

Pclass      0.083675
Sex         0.190060
Age         0.234741
SibSp       0.051893
Parch       0.034452
Fare        0.254560
Embarked    0.031173
titles      0.119446
dtype: float64

print(X)
Pclass  Sex Age SibSp   Parch   Fare    Embarked    titles
0   3   0   22.000000   1   0   7.2500  0   12
1   1   1   38.000000   1   0   71.2833 1   13
2   3   1   26.000000   0   0   7.9250  0   9
3   1   1   35.000000   1   0   53.1000 0   13
4   3   0   35.000000   0   0   8.0500  0   12
... ... ... ... ... ... ... ... ...
886 2   0   27.000000   0   0   13.0000 0   15
887 1   1   19.000000   0   0   30.0000 0   9
888 3   1   29.699118   1   2   23.4500 0   9
889 1   0   26.000000   0   0   30.0000 1   12
890 3   0   32.000000   0   0   7.7500  2   12

print(X.columns)
>>> Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'titles'], dtype='object')

详情请咨询Feature importances with a forest of trees

【讨论】：

【解决方案2】：

应谨慎使用内置函数“importance”！重要性可以通过许多不同的方式计算：变量拆分的频率是多少？对每个变量等进行拆分后的平均杂质是多少等。也就是说，您确切知道重要性是如何计算的，并且您同意它确实与您的“重要性”相对应，这一点非常重要-理解。

我建议查看 shap 来计算 SHAP 值，从而给出更可靠和“正确”的重要性答案。

【讨论】：