【问题标题】:Sample n rows from a data frame by group using another data frame使用另一个数据帧从一个数据帧中逐组采样 n 行
【发布时间】:2020-06-11 19:44:36
【问题描述】:

希望根据另一个数据帧的标准从一个数据帧中随机抽取 n 行。

示例

根据manufactureryear 分组从ggplot2::mpg 数据帧中随机抽取行,其中n = pick_df 数据帧的pick 列。

即从ggplot2::mpg中随机抽取3行,分别是2008年生产的本田、1999年生产的10辆大众、1999年生产的2辆奥迪等。

  manufacturer  year  pick
  <chr>        <int> <int>
1 honda         2008     3
2 volkswagen    1999    10
3 audi          1999     6
4 land rover    2008     2
5 subaru        1999     6

预期输出:

  manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class     
   <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>     
 1 honda        civic        1.8  2008     4 manual(m5) f        26    34 r     subcompact
 2 honda        civic        1.8  2008     4 auto(l5)   f        25    36 r     subcompact
 3 honda        civic        1.8  2008     4 auto(l5)   f        24    36 c     subcompact
 4 volkswagen   gti          2.8  1999     6 manual(m5) f        17    24 r     compact   
 5 volkswagen   passat       2.8  1999     6 manual(m5) f        18    26 p     midsize   
 6 volkswagen   new beetle   1.9  1999     4 auto(l4)   f        29    41 d     subcompact
 7 volkswagen   new beetle   2    1999     4 auto(l4)   f        19    26 r     subcompact
 8 volkswagen   jetta        1.9  1999     4 manual(m5) f        33    44 d     compact   
 9 volkswagen   passat       2.8  1999     6 auto(l5)   f        16    26 p     midsize   
10 volkswagen   jetta        2.8  1999     6 auto(l4)   f        16    23 r     compact   
11 volkswagen   new beetle   2    1999     4 manual(m5) f        21    29 r     subcompact
12 volkswagen   passat       1.8  1999     4 manual(m5) f        21    29 p     midsize   
13 volkswagen   gti          2    1999     4 auto(l4)   f        19    26 r     compact  

...27 rows total...

要从中采样的 mpg 数据帧的标题:

   manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class  
   <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
 1 audi         a4           1.8  1999     4 auto(l5)   f        18    29 p     compact
 2 audi         a4           1.8  1999     4 manual(m5) f        21    29 p     compact
 3 audi         a4           2    2008     4 manual(m6) f        20    31 p     compact
 4 audi         a4           2    2008     4 auto(av)   f        21    30 p     compact
 5 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p     compact
 6 audi         a4           2.8  1999     6 manual(m5) f        18    26 p     compact
 7 audi         a4           3.1  2008     6 auto(av)   f        18    27 p     compact
 8 audi         a4 quattro   1.8  1999     4 manual(m5) 4        18    26 p     compact
 9 audi         a4 quattro   1.8  1999     4 auto(l5)   4        16    25 p     compact
10 audi         a4 quattro   2    2008     4 manual(m6) 4        20    28 p     compact

reprex 的数据源:

采摘数据框来源pick_df

structure(list(manufacturer = c("honda", "volkswagen", "audi", 
"land rover", "subaru"), year = c(2008L, 1999L, 1999L, 2008L, 
1999L), pick = c(3L, 10L, 6L, 2L, 6L)), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -5L))

mpg要采样的数据框: ggplot2::mpg

到目前为止尝试过

我可以使用过滤器或可能的切片,但编码都是手动的。实际用例有数千行和数百个组。

filter(mpg, manufacturer=='honda', year==2008) %>% sample_n(3)
filter(mpg, manufacturer=='volkswagen', year==1999) %>% sample_n(10)
etc...

编辑: 可以循环过滤,但是有点丑:

df <- mpg[0,]
for(i in 1:nrow(pick_df)){
  temp <- filter(mpg, manufacturer==pick_df$manufacturer[i], year==pick_df$year[i]) %>% sample_n(pick_df$pick[i])
  df <- rbind(temp,df)
}

【问题讨论】:

    标签: r grouping sampling


    【解决方案1】:

    我们可以用'pick_df'做一个inner_join,按'manufacturer'、'year'分组,根据'pick'的first值得到sample_n

    library(dplyr)   
    library(ggplot20 
    mpg %>%
        inner_join(pick_df) %>% 
        group_by(manufacturer, year) %>%
        sample_n(first(pick))
    # A tibble: 27 x 12
    # Groups:   manufacturer, year [5]
    #   manufacturer model       displ  year   cyl trans      drv     cty   hwy fl    class       pick
    #   <chr>        <chr>       <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>      <int>
    # 1 audi         a4 quattro    1.8  1999     4 auto(l5)   4        16    25 p     compact        6
    # 2 audi         a6 quattro    2.8  1999     6 auto(l5)   4        15    24 p     midsize        6
    # 3 audi         a4            2.8  1999     6 auto(l5)   f        16    26 p     compact        6
    # 4 audi         a4 quattro    2.8  1999     6 auto(l5)   4        15    25 p     compact        6
    # 5 audi         a4            1.8  1999     4 auto(l5)   f        18    29 p     compact        6
    # 6 audi         a4            2.8  1999     6 manual(m5) f        18    26 p     compact        6
    # 7 honda        civic         1.8  2008     4 manual(m5) f        26    34 r     subcompact     3
    # 8 honda        civic         2    2008     4 manual(m6) f        21    29 p     subcompact     3
    # 9 honda        civic         1.8  2008     4 auto(l5)   f        24    36 c     subcompact     3
    #10 land rover   range rover   4.2  2008     8 auto(s6)   4        12    18 r     suv            2
    # … with 17 more rows
    

    【讨论】:

    • 感谢您的回复,比我开始创建的循环更好。
    • 认为接受有延迟...再次感谢。
    猜你喜欢
    • 2021-05-27
    • 2023-03-20
    • 1970-01-01
    • 2017-11-16
    • 2017-06-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多