【问题标题】:Dplyr summarise multiple columns based on conditionDplyr 根据条件汇总多列
【发布时间】:2019-03-06 17:38:35
【问题描述】:

我有一个这样的数据集:

df.in <-structure(list(id = c(1, 1, 2, 3), x1 = c(0, 1, NA, 0), x2 = c("Lorem ipsum dolor sit amet", 
                                                                    "dolore eu fugiat nulla pariatur", "Sed ut perspiciatis unde omnis", 
                                                                    "Nemo enim ipsam voluptatem"), x3 = c("Donec ullamcorper elit quis risus", 
                                                                                                          "Donec ullamcorper elit quis risus", "Curabitur euismod", "Mauris felis orci"
                                                                    )), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
                                                                    ))

> df.in
# A tibble: 4 x 4
     id    x1 x2                              x3                               
  <dbl> <dbl> <chr>                           <chr>                            
1     1     0 Lorem ipsum dolor sit amet      Donec ullamcorper elit quis risus
2     1     1 dolore eu fugiat nulla pariatur Donec ullamcorper elit quis risus
3     2    NA Sed ut perspiciatis unde omnis  Curabitur euismod                
4     3     0 Nemo enim ipsam voluptatem      Mauris felis orci 


我正在尝试dplyr::group_by() 来获取这个:

df.out <- structure(list(id = c(1, 2, 3), x1 = c(1, NA, 0), x2 = c("dolore eu fugiat nulla pariatur", 
                                                                   "Sed ut perspiciatis unde omnis", "Nemo enim ipsam voluptatem"
), x3 = c("Donec ullamcorper elit quis risus", "Curabitur euismod", 
          "Mauris felis orci")), row.names = c(NA, -3L), class = c("tbl_df", 
                                                                   "tbl", "data.frame"))

> df.out
# A tibble: 3 x 4
     id    x1 x2                              x3                               
  <dbl> <dbl> <chr>                           <chr>                            
1     1     1 dolore eu fugiat nulla pariatur Donec ullamcorper elit quis risus
2     2    NA Sed ut perspiciatis unde omnis  Curabitur euismod                
3     3     0 Nemo enim ipsam voluptatem      Mauris felis orci  


我能做到:

df.in %>%
  group_by(id) %>%
  summarise(x1 = max(x1))


但是,我该怎么做:

  1. 汇总x2x3 以保留出现max(x1) 的值?
  2. 我有几个x 都需要相同的逻辑。有没有办法做一个summarize_all

【问题讨论】:

    标签: r group-by dplyr tidyverse


    【解决方案1】:

    我们可以在summarise_at 中使用max 创建条件

    library(dplyr)
    df.in %>% 
      group_by(id) %>% 
      summarise_at(3:4, funs(if(n() == 1) . else .[x1 == max(x1, na.rm = TRUE)]))
    

    除了使用summarise_at,我们也可以使用filterslice

    df.in %>%
      group_by(id) %>% 
      filter((n() == 1) | (x1 == max(x1, na.rm = TRUE)))
    

    或使用slice

    df.in %>% 
      group_by(id) %>% 
      slice(which(n() == 1 | (x1 == max(x1, na.rm = TRUE)))[1])
    

    【讨论】:

    • 在第二个选项中,如果 x1 有平局会发生什么?
    • @ThomasSpeidel 它将获取所有有联系的行。对于这些情况,您希望发生什么
    • 谢谢。我想保留第一个非空字符串
    猜你喜欢
    • 1970-01-01
    • 2016-02-07
    • 1970-01-01
    • 2021-07-20
    • 2018-10-06
    • 1970-01-01
    • 2019-06-12
    • 2016-12-30
    • 2020-06-25
    相关资源
    最近更新 更多