R中随机森林的分层抽样答案

【问题标题】：Stratified sampling with Random Forests in RR中随机森林的分层抽样
【发布时间】：2013-01-28 07:41:49
【问题描述】：

我在randomForest的文档中阅读了以下内容：

strata：用于分层抽样的（因子）变量。

sampsize：要绘制的样本大小。对于分类，如果 sampsize 是一个长度为层数的向量，然后是采样按strata分层，sampsize的元素指出要从地层中抽取的数字。

作为参考，函数的接口由以下给出：

 randomForest(x, y=NULL,  xtest=NULL, ytest=NULL, ntree=500,
              mtry=if (!is.null(y) && !is.factor(y))
              max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
              replace=TRUE, classwt=NULL, cutoff, strata,
              sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
              nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
              maxnodes = NULL,
              importance=FALSE, localImp=FALSE, nPerm=1,
              proximity, oob.prox=proximity,
              norm.votes=TRUE, do.trace=FALSE,
              keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
              keep.inbag=FALSE, ...)

我的问题是：究竟如何使用strata 和sampsize？这是一个最小的工作示例，我想在其中测试这些参数：

library(randomForest)
iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", sep = ",", header = FALSE)
names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width", "iris.type")

model = randomForest(iris.type ~ sepal.length + sepal.width, data = iris)

> model
500 samples
  6 predictors
  2 classes: 'Y0', 'Y1' 

No pre-processing
Resampling: Bootstrap (7 reps) 

Summary of sample sizes: 477, 477, 477, 477, 477, 477, ... 

Resampling results across tuning parameters:

  mtry  ROC    Sens  Spec  ROC SD  Sens SD  Spec SD
  2     0.763  1     0     0.156   0        0      
  4     0.782  1     0     0.231   0        0      
  6     0.847  1     0     0.173   0        0      

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 6.

我之所以选择这些参数，是因为我希望 RF 使用能够尊重我的数据中正负比例的引导样本。

This other thread，开始讨论该主题，但没有说明如何使用这些参数就解决了。

【问题讨论】：

?randomForest 中演示分层抽样的示例代码对您来说不够清晰吗？
谢谢@joran。文档中提供的示例使用了sampsize，但没有使用strata。文档只说：strata: A (factor) variable that is used for stratified sampling。在这种情况下，我不清楚"used" 这个词。也许是因为我对分层抽样和 R 比较陌生。
如果您不提供响应变量，它可能会默认使用响应变量。如果您想要与响应变量不同的层，您可以自己提供。

标签： r

【解决方案1】：

这不就是这样吗：

model = randomForest(iris.type ~ sepal.length + sepal.width, 
                     data = iris, 
                     sampsize=c(10,10,10), strata=iris$iris.type)

我确实尝试过..., strata=iristype 和..., strata='iristype'，但显然代码不是为了在“数据”参数的环境中解释该值而编写的。我使用了结果变量，因为它是该数据集中唯一的因素变量，但我认为它不必是结果变量。事实上，我认为它绝对不应该是结果变量。这个特定的模型预计会产生无用的输出，并且仅用于测试语法。

【讨论】：

谢谢！这正是我想要的。
sampsize 的元素如何与每个层关联？等级顺序？
嗨安东尼奥； randomForest.default 中的代码会将非因子层参数转换为因子，然后在级别内进行采样，因此答案似乎是“是”。
实验证实了这一点。谢谢@BondedDust！