模拟不平衡的集群数据答案

【问题标题】：Simulate unbalanced clustered data模拟不平衡的集群数据
【发布时间】：2021-03-26 17:00:22
【问题描述】：

我想模拟一些不平衡的集群数据。聚类数为 20，平均观测数为 30。但是，我想为每个聚类创建一个不平衡的聚类数据，其中的观测数比指定的多 10%（即 33 而不是 30）。然后，我想随机排除适当数量的观察值（即 60 个），以达到每个集群指定的平均观察值数量（即 30 个）。在每个集群中排除观察的概率并不统一（即，一些集群没有删除案例，而其他集群则排除了更多）。因此，最后我仍然总共有 600 个观察值。任何人都知道如何在 R 中实现这一点？这是一个较小的示例数据集。每个集群的观察次数不符合上面指定的条件，我只是用这个来表达我的想法。

> y <- rnorm(20)
> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> df <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
> df
   id cluster           x           y
1   1       1  0.30003855  0.65325768
2   2       1 -1.00563626 -0.12270866
3   3       1  0.01925927 -0.41367651
4   4       1 -1.07742065 -2.64314895
5   5       1  0.71270333 -0.09294102
6   1       2  1.08477509  0.43028470
7   2       2 -2.22498770  0.53539884
8   3       2  1.23569346 -0.55527835
9   4       2 -1.24104450  1.77950291
10  5       2  0.45476927  0.28642442
11  1       3  0.65990264  0.12631586
12  2       3 -0.19988983  1.27226678
13  3       3 -0.64511396 -0.71846622
14  4       3  0.16532102 -0.45033862
15  5       3  0.43881870  2.39745248
16  1       4  0.88330282  0.01112919
17  2       4 -2.05233698  1.63356842
18  3       4 -1.63637927 -1.43850664
19  4       4  1.43040234 -0.19051680
20  5       4  1.04662885  0.37842390

随机添加和删除一些数据后，不平衡的数据变成这样：

            id   cluster   x     y
       1     1       1  0.895 -0.659 
       2     2       1 -0.160 -0.366 
       3     1       2 -0.528 -0.294 
       4     2       2 -0.919  0.362 
       5     3       2 -0.901 -0.467 
       6     1       3  0.275  0.134 
       7     2       3  0.423  0.534 
       8     3       3  0.929 -0.953 
       9     4       3  1.67   0.668 
      10     5       3  0.286  0.0872
      11     1       4 -0.373 -0.109 
      12     2       4  0.289  0.299 
      13     3       4 -1.43  -0.677 
      14     4       4 -0.884  1.70  
      15     5       4  1.12   0.386 
      16     1       5 -0.723  0.247 
      17     2       5  0.463 -2.59  
      18     3       5  0.234  0.893 
      19     4       5 -0.313 -1.96  
      20     5       5  0.848 -0.0613

编辑这部分问题解决了（归功于 jay.sf）。接下来，我想重复这个过程 1000 次并对每个生成的数据集运行回归。但是，我不想在整个数据集上运行回归，而是在一些选定的集群上运行随机选择的集群（可以使用这个函数：df[unlist(cluster[sample.int(k, k, replace = TRUE)], use.names = TRUE), ]。最后，我想从这 1000 个中获得置信区间回归。如何进行？

【问题讨论】：

标签： r simulation data-manipulation data-generation

【解决方案1】：

根据 Ben Bolker 的要求，我正在发布我的解决方案，但请参阅 jay.sf 以获得更普遍的答案。

#First create an oversampled dataset: 
  y <- rnorm(24)
  x <- rnorm(24)
  z <- rep(1:6, 4)
  w <- rep(1:4, each=6)
  df <- data.frame(id=z,cluster=w,x=x,y=y)
#Then just slice_sample to arrive at the sample size as desired
  df %>% slice_sample(n = 20) %>%
  arrange(cluster)
#Or just use base R
  a <- df[sample(nrow(df), 20), ]  
  df2 <- a[order(a$cluster), ]

【讨论】：

【解决方案2】：

让ncl 成为所需的集群数。我们可以生成一个采样空间S，它是一个公差序列tol，围绕每个集群的平均观测值mnobs。从中我们抽取repeatetly 大小为 1 的随机样本以获得集群列表CL。如果集群lengths 的总和满足ncl*mnobs 我们break 循环，则将随机数据添加到集群和rbind 结果。

FUN <- function(ncl=20, mnobs=30, tol=.1) {
  S <- do.call(seq.int, as.list(mnobs*(1 + tol*c(-1, 1))))
  repeat({
    CL <- lapply(1:ncl, function(x) rep(x, sample(S, 1, replace=T)))
    if (sum(lengths(CL)) == ncl*mnobs) break
  })
  L <- lapply(seq.int(CL), function(i) {
    id <- seq.int(CL[[i]])
    cbind(id, cluster=i, 
          matrix(rnorm(max(id)*2),,2, dimnames=list(NULL, c("x", "y"))))
  })
  do.call(rbind.data.frame, L)
}

用法

set.seed(42)
res <- FUN()  ## using defined `arg` defaults
dim(res)
# [1] 600   4

(res.tab <- table(res$cluster))
#  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
# 29 29 31 31 30 32 31 30 32 28 28 27 28 31 32 33 31 30 27 30

table(res.tab)
# 27 28 29 30 31 32 33 
#  2  3  2  4  5  3  1

sapply(c("mean", "sd"), function(x) do.call(x, list(res.tab)))
#      mean        sd 
# 30.000000  1.747178

可显示的示例

set.seed(42)
FUN(4, 5, tol=.3)  ## tol needs to be adjusted for smaller samples
#    id cluster           x          y
# 1   1       1  1.51152200 -0.0627141
# 2   2       1 -0.09465904  1.3048697
# 3   3       1  2.01842371  2.2866454
# 4   1       2 -1.38886070 -2.4404669
# 5   2       2 -0.27878877  1.3201133
# 6   3       2 -0.13332134 -0.3066386
# 7   4       2  0.63595040 -1.7813084
# 8   5       2 -0.28425292 -0.1719174
# 9   6       2 -2.65645542  1.2146747
# 10  1       3  1.89519346 -0.6399949
# 11  2       3 -0.43046913  0.4554501
# 12  3       3 -0.25726938  0.7048373
# 13  4       3 -1.76316309  1.0351035
# 14  5       3  0.46009735 -0.6089264
# 15  1       4  0.50495512  0.2059986
# 16  2       4 -1.71700868 -0.3610573
# 17  3       4 -0.78445901  0.7581632
# 18  4       4 -0.85090759 -0.7267048
# 19  5       4 -2.41420765 -1.3682810
# 20  6       4  0.03612261  0.4328180

【讨论】：

感谢@jay.sf 一如既往的详细解答！我实际上找到了一个“看似”的捷径答案：首先只是过采样，然后slice_sample 以达到所需的样本大小。
@cliu 当然，我们可以找到具有几乎任何特定问题的功能的特殊包。不过，从头开始学习如何做事情可能更可持续，并且可以防止上瘾:)
嗨@jay.sf。抱歉，我再次寻求您的帮助，但我刚刚编辑了我的问题，更新了对生成的集群数据运行回归的任务。你能分享一些关于实现的代码吗？根据您对另一个问题的回答，我写了这个，但该功能似乎是固定的，并且在我后来replicate 时不会改变它：BetaClus <- function() { clsamp.reg <- df[unlist(cluster[sample.int(k, k, replace = TRUE)], use.names = TRUE), ] x <- unlist(clsamp.reg["x"]) y <- unlist(clsamp.reg["y"]) clusmod <- lm(y ~ x) confint(clusmod, "x", level = 0.95)}。谢谢！
@cliu，您可以发布您的解决方案作为答案吗？
@BenBolker 当然