【问题标题】:Conditional aggregation based on groups in a data frame R基于数据帧 R 中的组的条件聚合
【发布时间】:2021-08-18 04:51:47
【问题描述】:

Data_Frame <- data.frame(Col1 = c("A1", "A1", "A1", "A2", "A2", "A2", "A3", "A3", "A3"),
    
                         Col2 = c("2011-03-11", "2014-08-21", "2016-01-17", "2017-06-30", "2018-07-11", "2018-11-28", "2019-09-04", "2020-02-29", "2020-07-12"),
                  
                         Col3 = c("2018-10-22", "2019-05-24", "2020-12-25", "2018-10-12", "2019-09-24", "2020-12-19", "2018-10-22", "2019-06-14", "2020-12-20"),
              
                         Col4 = c(4, 2, 2, 1, 4, 4, 4, 4, 4),
             
                         Col5 = c(7, 6, 3, 1, 3, 2, 5, 1, 2))

Data_Frame$Col2 <- as.Date(Data_Frame$Col2)
Data_Frame$Col3 <- as.Date(Data_Frame$Col3)
Data_Frame$Col1 <- as.factor(Data_Frame$Col1)

Data_Frame <- Data_Frame %>% group_by(Col1) %>% mutate(Col6 = lubridate::time_length(lubridate::interval(Col2, max(Col3)), "years"))

Data_Frame <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Col7 = ifelse(Col6 <= 1, 1, ifelse(Col6 >1 & Col6 <=2, 2, ifelse(Col6 >2 & Col6 <=5, 5, ifelse(Col6 >5 & Col6 <=10, 10, 11)))))

Data_Frame <- as.data.frame(Data_Frame)

是数据框,其中Col6表示Col2和Col3之间的时间差,Col2的元素从Col1中A1到A3各组中Col3的最大日期元素中减去,Col7表示Col6中的哪些元素

不同条件生成的附加列存在问题。

  1. Last1Col7 到 Last10Col7 的生成:

新列 Last1Col7 到 Last10Col7 是基于 Col7 创建的,并且在 Col7 中将 A1 到 A3 分组,这样

  • Last1Col7 表示 Col7 中有多少个元素(行数)
  • Last2Col7 对应于行数
  • Last5Col7 对应于每行

预期结果是:
以下代码:

Data_Frame1 <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Last1Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 1, ]),
                                                               
                                                               Last2Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 2, ]),
                                                               
                                                               Last5Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 5, ]),
                                                               
                                                               Last10Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 10, ]))

导致完全不同的结果:

  1. Last1SumCol4Col7 到 Last10SumCol4Col7 的生成:

    • Last1SumCol4Col7 是 Col4 中的条目之和,对应于 Col1 中 A1 到 A3 的每个组中 Col7 中有多少条目(行数)

    • Last2SumCol4Col7 是 Col4 中条目的总和,对应于 Col1 中 A1 到 A3 每组中 Col7 中有多少条目(行数)

    • Last5SumCol4Col7 是 Col4 中条目的总和,对应于 Col1 中 A1 到 A3 每组中 Col7 中有多少条目(行数)

    • Last10SumCol4Col7 是 Col4 中条目的总和,对应于 Col7 中的条目(行数)在 Col1 中的 A1 到 A3 的每个组中

预期结果是:

使用以下代码:

Data_Frame1 <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Last1SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=1, ]$Col4),
                                                              
                                                              Last2SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=2, ]$Col4),
                                                              
                                                              Last5SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=5, ]$Col4),
                                                              
                                                              Last10SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=10, ]$Col4))

结果是:

从 Last1Col7 到 Last10Col7 和 Last1SumCol4Col7 到 Last10SumCol4Col7 的所有初始条目都为零的列开始,然后使用上面的代码也无济于事。 1 和 3 下的代码有什么根本性的问题?

【问题讨论】:

    标签: r dataframe dplyr group-by conditional-statements


    【解决方案1】:

    我们可以使用map 循环比较使用的值,然后按“Col1”分组,在每个循环中创建两列,方法是取小于或等于“Col7”的sum循环的值,以及'Col4'对应值的sum,其中'Col7'小于或等于该值

    library(purrr)
    library(dplyr)
    map_dfc(c(1, 2, 5, 10), ~ Data_Frame %>%
         group_by(Col1) %>% 
         transmute(!! sprintf("Last%dCol7", .x) := sum(Col7 <= .x),
                   !! sprintf("Last%dSumCol4Col7", .x) := sum(Col4[Col7<= .x])) %>% 
         ungroup %>%
         select(-Col1)) %>% 
     bind_cols(Data_Frame, .)
    

    -输出

    #Col1       Col2       Col3 Col4 Col5      Col6 Col7 Last1Col7 Last1SumCol4Col7 Last2Col7 Last2SumCol4Col7 Last5Col7 Last5SumCol4Col7 Last10Col7
    #1   A1 2011-03-11 2018-10-22    4    7 9.7917808   10         0                0         0                0         1                2          3
    #2   A1 2014-08-21 2019-05-24    2    6 6.3452055   10         0                0         0                0         1                2          3
    #3   A1 2016-01-17 2020-12-25    2    3 4.9371585    5         0                0         0                0         1                2          3
    #4   A2 2017-06-30 2018-10-12    1    1 3.4712329    5         0                0         0                0         3                9          3
    #5   A2 2018-07-11 2019-09-24    4    3 2.4410959    5         0                0         0                0         3                9          3
    #6   A2 2018-11-28 2020-12-19    4    2 2.0575342    5         0                0         0                0         3                9          3
    #7   A3 2019-09-04 2018-10-22    4    5 1.2931507    2         2                8         3               12         3               12          3
    #8   A3 2020-02-29 2019-06-14    4    1 0.8060109    1         2                8         3               12         3               12          3
    #9   A3 2020-07-12 2020-12-20    4    2 0.4410959    1         2                8         3               12         3               12          3
    #  Last10SumCol4Col7
    #1                 8
    #2                 8
    #3                 8
    #4                 9
    #5                 9
    #6                 9
    #7                12
    #8                12
    #9                12
    

    OP 代码中给出错误sum 的问题是因为Data_Frame[Data_Frame$Col7 &lt;=2, ] 正在破坏组,并且正在获取整个列子集而不是组内的子集。在tidyverse内,我们不需要Data_Frame$,如果需要指定数据,使用.cur_data()。另外,这里我们只需要Col7 &lt;=2

    【讨论】:

      【解决方案2】:

      使用cut() 获取列Col7

      library(data.table)
      setDT(df1)[, `:=` (Col2 = as.Date(Col2), Col3 = as.Date(Col3) )]
      df1[, Col6 := lubridate::time_length(lubridate::interval(Col2, max(Col3)), "years"), by = Col1]
      df1[, Col7 := as.integer(as.character(cut(Col6, breaks = c(0, 1,2,5,10), labels = c(1,2,5,10)))), by = Col1]
      
      df1[, `:=` (Last1Col7 = 0, Last2Col7 = 0, Last5Col7 = 0, Last10Col7 = 0,
                  Last1SumCol4Col7 = 0, Last2SumCol4Col7 = 0, Last5SumCol4Col7 = 0, Last10SumCol4Col7 = 0) ]
      
      df1[Col7 <= 1, `:=` (Last1Col7 = .N, Last1SumCol4Col7 = sum(Col4)), by = Col1]
      df1[Col7 <= 2, `:=` (Last2Col7 = .N, Last2SumCol4Col7 = sum(Col4)), by = Col1]
      df1[Col7 <= 5, `:=` (Last5Col7 = .N, Last5SumCol4Col7 = sum(Col4)), by = Col1]
      df1[Col7 <= 10, `:=` (Last10Col7 = .N, Last10SumCol4Col7 = sum(Col4)), by = Col1]
      

      输出:

      df1
         Col1       Col2       Col3 Col4 Col5      Col6 Col7 Last1Col7 Last2Col7
      1:   A1 2011-03-11 2018-10-22    4    7 9.7917808   10         0         0
      2:   A1 2014-08-21 2019-05-24    2    6 6.3452055   10         0         0
      3:   A1 2016-01-17 2020-12-25    2    3 4.9371585    5         0         0
      4:   A2 2017-06-30 2018-10-12    1    1 3.4712329    5         0         0
      5:   A2 2018-07-11 2019-09-24    4    3 2.4410959    5         0         0
      6:   A2 2018-11-28 2020-12-19    4    2 2.0575342    5         0         0
      7:   A3 2019-09-04 2018-10-22    4    5 1.2931507    2         0         3
      8:   A3 2020-02-29 2019-06-14    4    1 0.8060109    1         2         3
      9:   A3 2020-07-12 2020-12-20    4    2 0.4410959    1         2         3
         Last5Col7 Last10Col7 Last1SumCol4Col7 Last2SumCol4Col7 Last5SumCol4Col7
      1:         0          3                0                0                0
      2:         0          3                0                0                0
      3:         1          3                0                0                2
      4:         3          3                0                0                9
      5:         3          3                0                0                9
      6:         3          3                0                0                9
      7:         3          3                0               12               12
      8:         3          3                8               12               12
      9:         3          3                8               12               12
         Last10SumCol4Col7
      1:                 8
      2:                 8
      3:                 8
      4:                 9
      5:                 9
      6:                 9
      7:                12
      8:                12
      9:                12
      

      数据:

      df1 <- data.frame(Col1 = c("A1", "A1", "A1", "A2", "A2", "A2", "A3", "A3", "A3"),
          
                               Col2 = c("2011-03-11", "2014-08-21", "2016-01-17", "2017-06-30", "2018-07-11", "2018-11-28", "2019-09-04", "2020-02-29", "2020-07-12"),
                        
                               Col3 = c("2018-10-22", "2019-05-24", "2020-12-25", "2018-10-12", "2019-09-24", "2020-12-19", "2018-10-22", "2019-06-14", "2020-12-20"),
                    
                               Col4 = c(4, 2, 2, 1, 4, 4, 4, 4, 4),
                   
                               Col5 = c(7, 6, 3, 1, 3, 2, 5, 1, 2))
      

      【讨论】:

      • 感谢 data.table 的这种方法
      猜你喜欢
      • 1970-01-01
      • 2015-03-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-01-28
      • 2022-11-04
      • 1970-01-01
      相关资源
      最近更新 更多