随机森林变量选择答案

【问题标题】：Random Forest Variable Selection随机森林变量选择
【发布时间】：2017-08-29 17:08:44
【问题描述】：

我有一个随机森林，目前建立在 100 个不同的变量上。我希望能够仅选择“最重要”的变量来构建我的随机森林以尝试提高性能，但除了从 rf$importance 获取重要性之外，我不知道从哪里开始。

我的数据只包含所有已缩放的数值变量。

下面是我的射频代码：

rf.2 = randomForest(x~., data=train,importance=TRUE, ntree=1501)

#train
rf_prob_train = data.frame(predict(rf.2, newdata=train, type="prob"))
rf_prob_train <-data.frame(rf_prob_train$X0)
val_rf_train<-cbind(rf_prob_train,train$x)
names(val_rf_train)<-c("Probs","x")

##Run accuracy ratio
x<-data.frame(rcorr.cens(-val_rf_train$Probs, val_rf_train$x))
rf_train_AR<-x[2,1]
rf_train_AR

#test
rf_prob_test = data.frame(predict(rf.2, test, type="prob"))
rf_prob_test <-data.frame(rf_prob_test$X0)
val_rf_test<-cbind(rf_prob_test,test$x)
names(val_rf_test)<-c("Probs","x")

##Run accuracy ratio
x<-data.frame(rcorr.cens(-val_rf_test$Probs, val_rf_test$x))
rf_test_AR<-x[2,1]
rf_test_AR

【问题讨论】：

您知道或知道哪些变量可能是多重共线性的吗？我发现减少多重共线性变量的数量会有所帮助。另外，您是否正在对连续变量进行归一化？这也为我带来了性能提升。但是，是的，只需使用 $importance 调用它们基本上就是它的完成方式。您也可以查看解释的 %variance，但他们说的或多或少是一样的。
谢谢你，我真的不知道到底是什么，但我可以有一个有根据的猜测。一旦我用 $importance 给他们打电话，你知道下一步该怎么做，然后只包括更重要的吗？目前我刚刚得到了我的变量列表和 MeanDecreaseGini
您只需要自己决定哪些要保留，哪些要拒绝。当您查看 MeanDecreaseGini 时，它看起来是渐近的吗？您可能只是抓住拐点上方的所有内容，然后留下其余部分。如果您需要基于方差解释之类的帮助子设置，请回复评论，我会写下来作为答案。
仅供参考，随机森林非常擅长避免共线性和自正则化问题。我强烈怀疑您是否会从删除变量中看到性能上的任何好处。
好吧，我读过的所有文档都同意你@Vincentmajor，但是，我在使用随机森林方面的个人经验表明，当我减少多共线的数量时，我会得到更好的每个变量 %VarExplained变量。这是有道理的；如果许多变量或多或少地描述相同的事物，则当它们都包含在 RF 模型中时，它们会拆分方差。根据您使用 RF 的目的，这可能重要也可能不重要。我发现自己不得不解释为什么我在模型中包含变量的次数比我想的要多得多，所以对我来说，越少越好。

标签： r machine-learning random-forest

【解决方案1】：

今天很忙，所以没能早点给你。这为您提供了使用通用数据集的总体思路。

library(randomForest)
library(datasets)

head(iris)
#To make our formula for RF easier to manipulate

var.predict<-paste(names(iris)[-5],collapse="+")
rf.form <- as.formula(paste(names(iris)[5], var.predict, sep = " ~ "))

print(rf.form)
#This is our current itteration of the formula we're using in RF

iris.rf<-randomForest(rf.form,data=iris,importance=TRUE,ntree=100)

varImpPlot(iris.rf)
#Examine our Variable importance plot

to.remove<-c(which(data.frame(iris.rf$importance)$MeanDecreaseAccuracy==min(data.frame(iris.rf$importance)$MeanDecreaseAccuracy)))
#Remove the variable with the lowest decrease in Accuracy (Least relevant variable)

#Rinse, wash hands, repeat

var.predict<-paste(names(iris)[-c(5,to.remove)],collapse="+")
rf.form <- as.formula(paste(names(iris)[5], var.predict, sep = " ~ "))

iris.rf<-randomForest(rf.form,data=iris,importance=TRUE,ntree=100)

varImpPlot(iris.rf)
#Examine our Variable importance plot

to.remove<-c(to.remove, which(data.frame(iris.rf$importance)$MeanDecreaseAccuracy==min(data.frame(iris.rf$importance)$MeanDecreaseAccuracy)))

#And so on...

【讨论】：

我希望，如果我们有一个函数解决方案来运行数据并只返回前五个最重要的变量。