在 R 中均匀间隔地采样随机行答案

【问题标题】：Sample random rows evenly spaced apart in R在 R 中均匀间隔地采样随机行
【发布时间】：2020-10-26 23:00:45
【问题描述】：

我有超过 50 年的测量值。我正在尝试对数据进行二次抽样，以查看如果我只在 2 年或 3、4、5 年等而不是全部 50 年中抽样，我会发现什么模式。我编写了一个代码，可以从数据集，但条件是这两个随机年份至少在数据集中分布（相隔 10 年，或其他什么）。

有条件随机抽样代码吗？

这是我目前正在做的事情。保持这种格式最容易，因为我 %>% 从这里开始处理其他内容。

# build df
df = data.frame(year=c(1:50),
                response = runif(50,1,100))

# set number of times I'll do the simulation
number_simulations <- 5 

# set number of years I'll sample in each simulation
# (I later put this in a for loop so that I could repeat 
#  this process with more and more sample years)
number_samples <- 2



df %>% 
  
  # repeat df x number of times
  replicate(number_simulations, ., simplify = FALSE) %>%  
  
  # pick n random samples from df
  map_dfr(~ sample_n(., number_samples), .id = "simulation")

# Can I change this code to make sure sampled years aren't too close to each other? 
# years 23 and 25 out of 50 won't tell me much. But 23 and 35 would be fine.

我认为最简单的方法是为sample_n_conditional() 创建一个函数，我可以替换map_dfr 行中的sample_n。那必须是一个函数，它表示某种“至少相隔 10 年的样本 n 年”。甚至取决于样本数量的更动态的东西，因为当我拉更多年时，相隔 10 年将变得不可持续。所以更像是“在系列中按合理比例分布的样本 n 年”。

我考虑将我的模拟总数更改为超出我需要的数量，然后过滤掉那些靠得太近的模拟，假设有足够的机会满足我的资格。但这并不理想。

任何想法表示赞赏。

【问题讨论】：

标签： r random dplyr purrr subsampling

【解决方案1】：

您可以使用repeat 循环，该循环仅在threshold 高于某个值时才会中断。

n.sim <- 5  ## number of simulations
n.samp <- 2  ## number of samples (also works for n.samp > 2)
thres <- 10  ## threshold

set.seed(42)
res <- replicate(n.sim, {
  repeat({
    samp <- df[sample(1:nrow(df), n.samp), ]
    if (all(abs(diff(samp[["year"]])) > thres)) break
    })
  samp
}, simplify=F)

结果

res
# [[1]]
# year  response
# 49   49 97.125694
# 37   37  1.726081
# 
# [[2]]
# year  response
# 1     1 91.565798
# 25   25  9.161318
# 
# [[3]]
# year response
# 10   10 70.80141
# 36   36 83.45869
# 
# [[4]]
# year response
# 18   18 12.63125
# 49   49 97.12569
# 
# [[5]]
# year response
# 47   47 88.88774
# 24   24 94.72016

数据：

set.seed(42)
df <- data.frame(year=1:50, response=runif(50, 1, 100))

【讨论】：