有条件的两个数据表的随机抽样答案

【问题标题】：Random sampling of two data tables with condition有条件的两个数据表的随机抽样
【发布时间】：2019-02-24 00:32:34
【问题描述】：

我试图在一个条件下对两个数据表进行采样，然后组合两个结果样本的列并复制这些步骤并将结果样本附加到一个新的数据表中。两张表的摘录（它们没有样本长度）：

data1
   month1 year
1: 1    2014
2: 2    2015
3: 3    2016
..

data2
   month2    
1: 4   
2: 5    
3: 6   
..

第一个样本： s1 = sample(data1[month = i ], 100, replace=TRUE) 其中i 从 1 变为 n

第二个样本： s2 = sample(data2[month > i ], 100, replace=TRUE) 其中i 应该大于为 s1 选择的月份。

这两个样本应该合并到一个新的数据表中，比如dt1 = cbind(s1,s2)

我想每个月重复这些步骤，并创建一个包含所有结果样本的新数据集（伪代码）：

 for(i in 1:10){
s1_i  = sample(data1[month = i ], 100, replace=TRUE)
s2_i = sample(data2[month > i ], 100, replace=TRUE)
new_i = cbind(s1_i,s2_i)
 }
allsamples = rbind(new_1,new_2,new_3,...)

我在编写这个循环时遇到了麻烦，它不应该为每一步都创建数据集，而是只创建 allsamples 数据集，所有样本都在其中组合在一起。

【问题讨论】：

标签： r datatable sample

【解决方案1】：

这个怎么样？

allsamples <- NULL
for(i in 1:length(month)){
  s1 <- sample(data1[month == i], 100, replace = TRUE)
  s2 <- sample(data1[month > i], 100, replace = TRUE)
  allsamples <- rbind(allsamples, cbind(s1, s2))
}

在设置后，您正在采样替换，您打算这样做吗？

可能有更好的方法来做到这一点，因为增长的对象通常很慢，但看看只有 12 个月的循环时间，我想这应该不会对你的性能造成太大影响。

【讨论】：

【解决方案2】：

这是我的解决方案：

  newsample =list()
  begin_time = 1 
  end_time = 20 
  for(i in  begin_time:end_time){
      datasub1 <-data1[data1$var == i,]  #filter data on condition
      s1 <-  datasub1[sample(nrow( datasub1), 10, replace=T), ]  #sample
      datasub2 <- data2[data2$var2 > i,]
      s2 <- datasub2[sample(nrow(datasub2), 10, replace=T), ]
      newsample[[i-(begin_time-1])] <- cbind(s1,s2) #combine and store in list
   }
 allsample = rbindlist(newsample) #stack samples as data table

【讨论】：