设置 rnorm 参数等于向量答案

【问题标题】：Set rnorm parameters equal to vector设置 rnorm 参数等于向量
【发布时间】：2018-07-21 19:51:36
【问题描述】：

我有一个数据框，其中包含样本大小、均值和标准差的列，以及 target 值：

ssize <- c(200, 300, 150)
mean <- c(10, 40, 50)
sd <- c(5, 15, 65)
target <- c(7, 23, 30)
df <- data.frame(ssize, mean, sd, target)

我希望添加另一个变量below，它返回小于target 值的元素数量，该变量来自具有参数mean 和sd 以及样本大小ssize 的正态分布。但是，我无法让rnorm 将每行的值用作参数。例如，运行

df$below <- sum(rnorm(df$ssize, df$mean, df$sd) < df$target)

生成样本大小等于length(df$ssize) 而不是df$ssize 本身的值的分布。

更新：大型数据集的数据表解决方案？

@alistaire 和 @G5W 的解决方案效果很好，但我想从每行的 100 个 rnorm 复制中提取 below 的平均值。我尝试了两种解决方案：

df <- df %>% mutate(below = mean(replicate(100, pmap_int(., ~sum(rnorm(..1, ..2, ..3) < ..4)))))

df$below <- with(df, sapply(1:nrow, function(i) mean(replicate(100, sum(rnorm(n[i], mean[i], sd[i]) < target[i])))))

但是他们需要很长时间才能运行我的数据集，该数据集有 >430 万行。是否有可能更快的数据表（或其他）解决方案？

【问题讨论】：

标签： r data.table distribution normal-distribution

【解决方案1】：

列表列是执行此操作的一种自然方式，因此您可以将样本存储在生成它们的参数旁边。使用 purrr 进行迭代，

library(tidyverse)
set.seed(47)    # for reproducibility

df <- data_frame(n = c(200, 300, 150),    # rename to name of parameter in rnorm so pmap works naturally
                 mean = c(10, 40, 50), 
                 sd = c(5, 15, 65), 
                 target = c(7, 23, 30))

df %>% 
    mutate(samples = pmap(.[1:3], rnorm),    # iterate in parallel over parameters and store samples as list column
           below = map2_int(samples, target, ~sum(.x < .y)))    # iterate over samples and target, calculate number below, and simplify to integer vector
#> # A tibble: 3 x 6
#>       n  mean    sd target samples     below
#>   <dbl> <dbl> <dbl>  <dbl> <list>      <int>
#> 1   200    10     5      7 <dbl [200]>    47
#> 2   300    40    15     23 <dbl [300]>    41
#> 3   150    50    65     30 <dbl [150]>    58

【讨论】：

非常酷。有没有办法在同一操作中删除 samples 列？我有 >4.3m 行，分布的样本大小通常 >5m。
当然，但更简单的是不将其分配给以开头的列：df %>% mutate(below = map2_int(pmap(.[1:3], rnorm), target, ~sum(.x < .y))) 或在单个迭代中，df %>% mutate(below = pmap_int(., ~sum(rnorm(..1, ..2, ..3) < ..4)))

【解决方案2】：

您可以在基础 R 中使用 lapply 和一个临时函数来执行此操作

df$below = with(df,  
    sapply(1:3, function(i) sum(rnorm(ssize[i], mean[i], sd[i]) < target[i])))
df$below
[1] 44 45 48

【讨论】：

我喜欢 mapply 以免迭代索引，但它必须与 do.call 一起使用，除非您想输入所有变量名称：do.call(mapply, c(function(...) sum(rnorm(..1, ..2, ..3) < ..4), df))。好像一定有更优雅的框架，但我找不到。