使用 for 循环从正态分布中采样答案

【问题标题】：Sampling from a normal distribution using a for loop使用 for 循环从正态分布中采样
【发布时间】：2020-06-03 03:36:18
【问题描述】：

所以我尝试从均匀分布中抽样 1000 次，每次计算来自所述均匀分布的 20 个随机样本的平均值。

Now let's loop through 1000 times, sampling 20 values from a uniform distribution and computing the mean of the sample, saving this mean to a variable called sampMean within a tibble called uniformSampleMeans.
{r 2c}

unif_sample_size = 20 # sample size
n_samples = 1000 # number of samples

# set up q data frame to contain the results
uniformSampleMeans <- tibble(sampMean = runif(n_samples, unif_sample_size))


# loop through all samples.  for each one, take a new random sample, 
# compute the mean, and store it in the data frame

for (i in 1:n_samples){
  uniformSampleMeans$sampMean[i] = summarize(uniformSampleMeans = mean(unif_sample_size))
}

我成功生成了一个 tibble，但是值是“NaN”。此外，当我进入我的 for 循环时，我得到一个错误。

Error in summarise_(.data, .dots = compat_as_lazy_dots(...)) : argument ".data" is missing, with no default

任何见解将不胜感激！

【问题讨论】：

您在帖子的主体中说的是“正态分布”（两次），但使用 runif 从 uniform 分布中采样。是不是打错字了？
对于均匀分布的随机抽样 (runif())，您需要三个参数：样本数 n、最小值 min 和最大值 max。您不能仅使用两个参数从 runif() 生成随机样本。
@AdamB., runif(1,-1) 正常工作。
不知道，干杯。我只是意识到runif(1) 也有效。但这些似乎是非常具体的默认值 - 例如runif(1000, 20) 不起作用，即使它起作用（使用与 runif(1, -1) 相同的逻辑），它也不会做@Salma Abdel-Raheem 似乎希望它做的事情
是的，这是一个错字，很抱歉它来自统一分布。我已经更新了我的帖子，以反映我的教授在作业中发布的问题。

标签： r loops dplyr normal-distribution

【解决方案1】：

你不需要 dplyr。

rep<-1000
size<-20

# initialize the dataframe
res<-data.frame(rep=NA,mean=NA)

for ( i in 1:rep) {
        samp<-rnorm(size) # here you actually create your sample of 20 numbers from the normal distribution
        res[i,]$rep<-i #save in the first column the number of the replicate sampling (optional)
        res[i,]$mean<-mean(samp) # here you calculate the mean of the random sample and store it into the datafra
}
res

【讨论】：

【解决方案2】：

如果您想做的是从具有 20 个观察值（最小值为 0 和最大值为 1）的随机均匀分布中生成样本的 1000 次重复，然后取每个样本的平均值，这是一种简洁的方法与tidyverse：

library(tidyverse)

uniform_samples <- map(1:1000, ~ runif(20, 0, 1))
uniform_sample_means <- map_dbl(uniform_samples, ~ mean(.x))

【讨论】：

为什么需要tidyverse？虽然你可以只做基础 (lapply)，但为此你可以使用 purrr。
1) 我更喜欢尽可能使用公式，它们使事情变得清晰简洁，而 afaik lapply() 不支持它们。 2) 我刚刚尝试用lapply() 解决问题，它不会接受1:1000 向量而不想将其用作n。因此，我不得不使用lapply(rep(20, 1000), runif, min = 0, max = 1)，这对我来说似乎不太优雅。 3) 要获得均值向量（而不是列表），我必须使用 sapply() 并且 sapply 以不一致而闻名。
lapply(1:3, runif, n = 20, min = -1)。 lapply 和 sapply 默认使用第一个参数，但如果您分配其他参数，您可以让它们使用第一个未指定的参数。我不反对表达功能，我资助它们更简洁。我讨论的一半是为了两个功能加载所有几十个tidyverse 包。意识到不是每个人都可以安装tidyverse（政策或其他原因），或者他们想尝试这个并且必须经历很长的编译时间才能测试两个功能。只考虑library(purrr)。
我明白了，这对于许多有争议的讨论。我没有安装所有的 tidyverse（尽管我每天都使用dplyr、tidyr），可能永远不会安装，而且我见过很多初学者也没有安装。任何操作系统上的首次编译时间都令人生畏，我们的即时满足社会并不能很好地解决这个问题:-) 归根结底，这是您的选择，感谢您的讨论。
我定期使用purrr，但对我来说很少有比美学更实用的东西（例如，~ 函数）；虽然我完全同意sapply 的反复无常，但我对使用基本*apply 函数持相当的防御态度。 stringr 不错，但我还没有找到几乎不能直接用 gsub 或 gregexpr/regmatches 完成的东西。也许我错过了？ ¯\_(ツ)_/¯

【解决方案3】：

逐行构建data.frame 的性能非常糟糕（每次添加一个时，它都会对所有行进行完整的复制......所以第 900 行，添加一行你有原始的 900 行两次 ...这扩展性很差）。

另外，请注意，抽取许多小的随机样本比只抽取一个较大的样本要昂贵得多。

set.seed(42)
m <- matrix(rnorm(1000*20), ncol = 20)
head(m)
#        [,1]   [,2]   [,3]   [,4]   [,5]    [,6]   [,7]    [,8]    [,9]   [,10]  [,11]   [,12]
# [1,]  1.371  2.325  0.251 -0.686 -0.142  0.0712  0.173  1.4163 -0.0575 -0.9221  1.163 -0.2945
# [2,] -0.565  0.524 -0.278 -0.793 -0.814  0.9703 -1.273  0.5572 -0.2490 -0.4958 -0.190  0.4641
# [3,]  0.363  0.971 -1.725 -0.407 -0.326  0.3100 -0.868  0.9812 -1.5242 -3.1105 -0.289 -1.5371
# [4,]  0.633  0.377 -2.007 -1.149  0.378 -0.1395  0.626 -0.5862  0.4636 -0.6928 -0.399  0.9862
# [5,]  0.404 -0.996 -1.292  1.116 -1.994 -0.3263 -0.106  0.9392 -1.1876  0.2989  0.709  0.6302
# [6,] -0.106 -0.597  0.366 -0.879 -0.999 -0.1188 -0.256 -0.0647  0.4941 -0.0687 -1.623  0.0573
#        [,13]    [,14]    [,15]  [,16]  [,17]  [,18]   [,19]  [,20]
# [1,]  0.0538 -1.80043 -2.29607 -1.020  0.496  0.110  1.0251  1.790
# [2,]  0.7534 -0.10643  0.00465 -0.754  0.519 -0.741 -1.4492 -0.262
# [3,]  0.2499  1.83347 -1.61634 -1.226 -0.422 -0.511  1.4175 -1.297
# [4,] -0.4441  1.02390  1.73313 -1.017  0.863 -0.912 -1.0353  0.618
# [5,] -0.0503 -0.00429 -0.67368  1.722 -0.778 -1.293  0.0853 -0.292
# [6,] -0.4678  2.27991 -0.09442  3.000  0.148  0.905  0.2451 -0.301
m2 <- apply(m, 1, mean)
length(m2)
# [1] 1000
head(m2)
# [1]  0.1513 -0.2089 -0.4366 -0.0339 -0.1544  0.0959
mean(m[1,])
# [1] 0.151
tibble(i = seq_along(m2), mu = m2)
# # A tibble: 1,000 x 2
#        i      mu
#    <int>   <dbl>
#  1     1  0.151 
#  2     2 -0.209 
#  3     3 -0.437 
#  4     4 -0.0339
#  5     5 -0.154 
#  6     6  0.0959
#  7     7  0.105 
#  8     8 -0.503 
#  9     9  0.0384
# 10    10 -0.175 
# # ... with 990 more rows

【讨论】：

【解决方案4】：

鉴于您已将此标记为dplyr 问题，您可以使用summarise_all：

library(dplyr)

n_obs = 20 
n_samples = 1000 

samples <- data.frame(matrix(runif(n_obs * n_samples), nrow = 20))

summarise_all(samples, mean)

正如其他人所指出的，在基础 R 中也可以做到这一点。

更新每个 OP 评论
是的，可以使用 for 循环，但不建议这样做。这是一种方法：

unif_sample_size = 20 
n_samples = 1000 
total_draws <- unif_sample_size * n_samples

uniformSampleMeans <- 
  tibble(draw_from_uniform = runif(n_samples * unif_sample_size))

sample_means <- vector(length = n_samples)

i <- 1
for (ix in seq(1, total_draws, by = unif_sample_size)) {
  start <- ix
  end <- ix + unif_sample_size - 1
  sample_means[i] <- mean(uniformSampleMeans$draw_from_uniform[start:end])
  i <- i + 1
}

【讨论】：

有没有办法使用for循环来做到这一点？
Salma，有很多理由不应该使用for 循环来完成这样的事情。但是，通常情况下，“是的”。在学习一门新的编程语言时，最难学习的事情之一就是不要将其他语言的效率和假设投射到当前语言上。在这种情况下，for 循环虽然不是邪恶的，但并不总是在 R 中做事的惯用甚至最有效的方式。请对查看和处理数据的不同方式敞开心扉。跨度>