R：从钻石数据集中的每个切工质量中抽取 100 个随机价格？答案

【问题标题】：R: Draw 100 random prices from each cut quality in diamonds data set?R：从钻石数据集中的每个切工质量中抽取 100 个随机价格？
【发布时间】：2019-07-15 20:06:33
【问题描述】：

我正在使用钻石数据集：

install.packages("ggplot2")
library(ggplot2)
data("diamonds")

我必须创建一个数据框，从每个切割质量（一般、良好、非常好、优质、理想）中随机抽取 100 个价格，这将给我 500 个数据点。我在到达那里时遇到了一些麻烦，任何帮助将不胜感激！这是我尝试过的一个公式，但我似乎无法弄清楚如何包含所有属于“剪切”的子集。

diamonds$price[ sample( diamonds$cut, size=100, replace=FALSE )]

我也尝试过使用聚合函数，但这似乎让我离我应该去的地方更远了。我确定我只是遗漏了一些相当明显的东西，但我对此很陌生，我在网上找不到任何关于它的信息。谢谢！

感谢 Camille，我能够做到：

 Test.1<-diamonds %>%
      group_by(cut) %>%
      sample_n(size = 100) %>%
      count(price)

我现在似乎无法处理这些数据，因为我需要找到每个切割质量的平均标准偏差等。

【问题讨论】：

标签： r dataframe random sample

【解决方案1】：

您可以使用拆分-应用-组合逻辑来做到这一点。

library(ggplot2)
data(diamonds)

head(diamonds)

xy <- split(diamonds, f = diamonds$cut)

xy <- lapply(xy, FUN = function(x) {
  x[sample(1:nrow(x), 100), ]
})

xy <- do.call(rbind, xy)
table(xy$cut)

 Fair      Good Very Good   Premium     Ideal 
  100       100       100       100       100

【讨论】：

【解决方案2】：

不需要比dplyr 更复杂。 dplyr::sample_n 可以对分组数据帧进行操作，这样每组 N 个样本都取自其中一个组。

library(dplyr)
library(ggplot2)

diamonds %>%
  group_by(cut) %>%
  sample_n(size = 100)
#> # A tibble: 500 x 10
#> # Groups:   cut [5]
#>    carat cut   color clarity depth table price     x     y     z
#>    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1  0.7  Fair  D     SI2      65.6    55  2167  5.59  5.5   3.64
#>  2  1.01 Fair  E     SI1      64.8    58  4480  6.34  6.29  4.09
#>  3  0.7  Fair  G     VS1      65.2    57  2290  5.56  5.52  3.61
#>  4  0.7  Fair  F     I1       65.4    59   992  5.6   5.49  3.63
#>  5  1    Fair  G     SI1      63.1    59  4163  6.32  6.27  3.97
#>  6  2.01 Fair  E     SI2      62.1    66 14948  7.99  7.92  4.94
#>  7  0.7  Fair  G     VS1      56.2    65  2384  5.93  5.88  3.32
#>  8  0.7  Fair  I     VS1      60.2    66  2234  5.77  5.62  3.44
#>  9  0.7  Fair  G     VS2      66.5    57  2575  5.4   5.46  3.61
#> 10  1.13 Fair  F     VS1      64.5    55  7335  6.62  6.56  4.25
#> # … with 490 more rows

验证：

diamonds %>%
  group_by(cut) %>%
  sample_n(size = 100) %>%
  count(cut)
#> # A tibble: 5 x 2
#> # Groups:   cut [5]
#>   cut           n
#>   <ord>     <int>
#> 1 Fair        100
#> 2 Good        100
#> 3 Very Good   100
#> 4 Premium     100
#> 5 Ideal       100

^{由reprex package (v0.2.1) 于 2019 年 2 月 21 日创建}

【讨论】：

非常有帮助，谢谢！我最终使用了你的方法，创建了一个包含价格的数据框（Test.1）（顶部的编辑代码）。您是否知道我将如何找到 Test.1 数据框中列出的价格的平均值？我无法将其干净地转换为数字。
不确定无法转换是什么意思，但您可以使用 dplyr::summarize 为每个组进行所需的任何汇总计算，尽管这涉及到一个不同的问题，应该有很多已经有关于 SO 的答案了。