【问题标题】:How do you filter out data in the first group based data in the second group in dplyr and/or tidyverse如何在 dplyr 和/或 tidyverse 中根据第二组中的数据过滤掉第一组中的数据
【发布时间】:2021-06-19 15:56:43
【问题描述】:

我有一个数据框 (df),其中包括以下列:马名、年龄和速度数据(值)。最初,我使用 ggplot geom_boxplot 绘制数据,以查看按年龄划分的平均速度值。

现在我想做同样的情节,但这次只包括在两岁时参加过 3 次以上比赛的马匹,但我正在努力弄清楚如何实现这一点。

我尝试分组(马,年龄),然后总结每匹马在每个年龄的比赛次数,最后过滤掉 2 岁时 n

谁能想到一个优雅的方式来完成这个。这看起来很简单,但我很挣扎。

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(brew)
#> Warning: package 'brew' was built under R version 4.0.3

df <- tibble(horse=c("a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","c","c","c","c","c","c","c","c","c","c","c","c","d","d","d","d","d","d"),
             age = c(2,2,2,2,2,3,3,3,4,4,2,2,3,3,3,4,2,2,2,2,2,3,3,3,3,3,4,4,2,3,3,3,3,4),
             value = c(20,21,19,23,20,17,16,23,24,14,23,24,18,19,16,19,17,24,19,18,17,15,18,12,12,14,15,11,23,24,14,23,24,18))


df
#> # A tibble: 34 x 3
#>    horse   age value
#>    <chr> <dbl> <dbl>
#>  1 a         2    20
#>  2 a         2    21
#>  3 a         2    19
#>  4 a         2    23
#>  5 a         2    20
#>  6 a         3    17
#>  7 a         3    16
#>  8 a         3    23
#>  9 a         4    24
#> 10 a         4    14
#> # ... with 24 more rows



df %>%  
  ggplot(aes(x=as.factor(age), y=value, fill=as.factor(age))) +
  geom_boxplot(alpha=0.7) +
  stat_summary(fun.y=mean, geom="point", shape=20, size=8, color="red", fill="red") +
  stat_summary(fun = mean, geom = "text", col = "black",     # Add text to plot
               vjust = -1.5, aes(label = paste("X:", round(..y.., digits = 1)))) +
  theme(legend.position="none") +
  scale_fill_brewer(palette="Set1")
#> Warning: `fun.y` is deprecated. Use `fun` instead.

reprex package (v0.3.0) 于 2021-06-19 创建

【问题讨论】:

    标签: r dplyr tidyverse


    【解决方案1】:

    这里有几种方法可以让马匹在 2 岁时比赛超过 3 次。

    1. 使用filter -
    library(dplyr)
    
    df %>%
      group_by(horse) %>%
      filter(sum(age == 2) > 3) %>%
      ungroup
    
    #   horse   age value
    #   <chr> <dbl> <dbl>
    # 1 a         2    20
    # 2 a         2    21
    # 3 a         2    19
    # 4 a         2    23
    # 5 a         2    20
    # 6 a         3    17
    # 7 a         3    16
    # 8 a         3    23
    # 9 a         4    24
    #10 a         4    14
    # … with 12 more rows
    
    1. 使用连接
    df %>%
      filter(age == 2) %>%
      count(horse) %>%
      filter(n > 3) %>%
      select(-n) %>%
      left_join(df, by = 'horse')
    

    【讨论】:

      【解决方案2】:

      如果我正确理解了您的目标,那么以下应该可以工作。

      在这里,我假设您想要保留那些在 2 岁时至少参加 3 场比赛的马的所有观察结果,也就是说,还要保留之前和之后的比赛,而不仅仅是那些 2 岁时的观察结果旧的。

      library(dplyr)
      
      df <- tibble(horse=c("a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","c","c","c","c","c","c","c","c","c","c","c","c","d","d","d","d","d","d"),
                   age = c(2,2,2,2,2,3,3,3,4,4,2,2,3,3,3,4,2,2,2,2,2,3,3,3,3,3,4,4,2,3,3,3,3,4),
                   value = c(20,21,19,23,20,17,16,23,24,14,23,24,18,19,16,19,17,24,19,18,17,15,18,12,12,14,15,11,23,24,14,23,24,18))
      
      df %>% group_by(horse, age) %>% 
        mutate(n_races_by_age = n(),         
               check_if_keep = if_else(age == 2 & n_races_by_age >= 3, 1, 0)) %>% 
        ungroup(age) %>% 
        mutate(
          horse_to_keep = max(check_if_keep)
          # it is still grouped by horse, so keep all observations of those horses for 
          # which the above conditions are met. 
        )
      #> # A tibble: 34 x 6
      #> # Groups:   horse [4]
      #>    horse   age value n_races_by_age check_if_keep horse_to_keep
      #>    <chr> <dbl> <dbl>          <int>         <dbl>         <dbl>
      #>  1 a         2    20              5             1             1
      #>  2 a         2    21              5             1             1
      #>  3 a         2    19              5             1             1
      #>  4 a         2    23              5             1             1
      #>  5 a         2    20              5             1             1
      #>  6 a         3    17              3             0             1
      #>  7 a         3    16              3             0             1
      #>  8 a         3    23              3             0             1
      #>  9 a         4    24              2             0             1
      #> 10 a         4    14              2             0             1
      #> # … with 24 more rows
      

      如果这就是您的意思,那么您只需添加 %&gt;% filter(horse_to_keep==1) 即可获得所需的结果。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2012-08-13
        • 1970-01-01
        • 1970-01-01
        • 2022-11-04
        • 2018-11-07
        • 2021-07-22
        • 1970-01-01
        相关资源
        最近更新 更多