randomForest 包的功能重要性答案

【问题标题】：function importance for randomForest packagerandomForest 包的功能重要性
【发布时间】：2020-05-12 05:17:00
【问题描述】：

我想使用随机森林来找到分类问题的最重要特征（我有两个类：0 和 1）。

我创建了模型：

rf = randomForest(y  ~ ., data = df, sampsize=100000,ntree=100, importance = TRUE, keep.forest = FALSE)

然后我使用以下内容来检查重要性：

importance(rf, type = 1, class = 1)

我读到类参数可用于分类问题。我的问题是我是否必须按平均降低精度中的 绝对值 对结果进行排序。当我使用VarImpPlot 时，我似乎也应该考虑负值。而参数class = 1究竟是什么？

【问题讨论】：

嗨，Sara，如果您的数据准备正确，您的代码看起来是正确的。 Lemme check class again..所以是否排序，取决于你想对结果做什么？

标签： r random-forest feature-selection

【解决方案1】：

我们可以使用 iris 数据集，它有 3 个物种：

数据（虹膜）表（iris$物种）

setosa versicolor  virginica 
    50         50         50

我们拟合一个随机森林：

library(randomForest)
mdl = randomForest(Species~.,data=iris,importance=TRUE)
# let's do it without options
importance(mdl)
                setosa versicolor virginica MeanDecreaseAccuracy
Sepal.Length  6.364533  6.2112640  7.632076            10.365371
Sepal.Width   4.790211  0.4339124  5.500338             5.153676
Petal.Length 22.027701 34.5777755 29.080648            35.215194
Petal.Width  22.500729 31.1403378 30.714576            33.335003
             MeanDecreaseGini
Sepal.Length         9.223319
Sepal.Width          2.189763
Petal.Length        44.703684
Petal.Width         43.163546

上表是你的所有结果，如果你做重要性（mdl，type = 1）你会降低这个变量所有类的平均准确度。您会看到您可以预测的每个类别（setosa、versicolor、virginica）的三个单独的列，所以如果您这样做：

importance(mdl,type=1,class="setosa")
                setosa
Sepal.Length  6.364533
Sepal.Width   4.790211
Petal.Length 22.027701
Petal.Width  22.500729

您可以更改与此类相关的准确性。

因此，在您的代码中，当您执行 importance(rf, type = 1, class = 1) 并且您的模型是 randomForest(y ~ ., data = df... ) 时，您正试图找出变量的重要性，该变量与 y 中标签为 1 的预测相关。

最后，您可以对它们进行排序：

res = importance(mdl,type=1,class="setosa")
res = res[order(res[,1],decreasing=TRUE),drop=FALSE,]
res

【讨论】：