重采样并按集群替换答案

【问题标题】：Resample with replacement by cluster重采样并按集群替换
【发布时间】：2017-05-09 19:46:09
【问题描述】：

我想从数据集中绘制带有替换的集群（由变量 id 定义），与之前回答的问题相比，我希望选择 K 次的集群以使每个观察重复 K 次。也就是说，我正在做集群引导。

例如，以下样本id=1 两次，但在新数据集s 中仅重复一次id=1 的观察。我希望来自id=1 的所有观察结果出现两次。

f <- data.frame(id=c(1, 1, 2, 2, 2, 3, 3), X=rnorm(7))
set.seed(451)
new.ids <- sample(unique(f$id), replace=TRUE)
s <- f[f$id %in% new.ids, ]

【问题讨论】：

标签： r resampling statistics-bootstrap

【解决方案1】：

一个选项是lapply 覆盖每个new.id 并将其保存在一个列表中。然后你可以把它们叠加在一起：

library(data.table)
rbindlist(lapply(new.ids, function(x) f[f$id %in% x,]))
#  id           X
#1:  1  1.20118333
#2:  1 -0.01280538
#3:  1  1.20118333
#4:  1 -0.01280538
#5:  3 -0.07302158
#6:  3 -1.26409125

【讨论】：

【解决方案2】：

以防万一需要有一个与索引号（即样本顺序）相对应的“new_id”——（我需要有“new_id”，这样我就可以运行混合效果模型而无需多个实例集群被视为一个集群，因为它们共享相同的 id）：

library(data.table)
f = data.frame( id=c(1,1,2,2,2,3,3), X = rnorm(7) )
set.seed(451); new.ids = sample( unique(f$id), replace=TRUE )
## ss has unique valued `new_id` for each cluster
ss = rbindlist(mapply(function(x, index) cbind(f[f$id %in% x,], new_id=index),
                      new.ids,
                      seq_along(new.ids),
                      SIMPLIFY=FALSE
))
ss

给出：

> ss
   id          X new_id
1:  1 -0.3491670      1
2:  1  1.3676636      1
3:  1 -0.3491670      2
4:  1  1.3676636      2
5:  3  0.9051575      3
6:  3 -0.5082386      3

注意 X 的值是不同的，因为 set.seed 在rnorm() 调用之前没有设置，但 id 与@Mike H 的答案相同。

这个链接对我构建这个答案很有用：R lapply statement with index [duplicate]

【讨论】：