【问题标题】:Calculate cumulative sum in a group_by() on two different sets of columns in dplyr计算 dplyr 中两组不同列的 group_by() 中的累积总和
【发布时间】:2020-10-02 07:41:39
【问题描述】:

我的初始数据框如下所示:

library(tidyverse)

df_input <- data.frame(
            cohort = c("2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
                       "2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
                       "2019-03-01", "2019-04-01", "2019-04-01", "2019-04-01",
                       "2019-04-01", "2019-04-01", "2019-04-01", "2019-04-01"),
            months = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7),
               CLV = c(59.9, 61.1, 62.06, 62.58, 62.83, NA, NA, NA, NA, 22.2, 38.24,
                       46.08, 56.28, NA, NA, NA),
           CLV_for = c(1.66, 1.42, 1.42, 1.42, 1.18, 1.18, 1.18, 1.18, 0.95, 35.75,
                       26.1, 16.09, 10.37, 7.15, 6.08, 5.01)
      )

       cohort months   CLV CLV_for
1  2019-03-01      1 59.90    1.66
2  2019-03-01      2 61.10    1.42
3  2019-03-01      3 62.06    1.42
4  2019-03-01      4 62.58    1.42
5  2019-03-01      5 62.83    1.18
6  2019-03-01      6    NA    1.18
7  2019-03-01      7    NA    1.18
8  2019-03-01      8    NA    1.18
9  2019-03-01      9    NA    0.95
10 2019-04-01      1 22.20   35.75
11 2019-04-01      2 38.24   26.10
12 2019-04-01      3 46.08   16.09
13 2019-04-01      4 56.28   10.37
14 2019-04-01      5    NA    7.15
15 2019-04-01      6    NA    6.08
16 2019-04-01      7    NA    5.01

我想从CLV 列中每个组中的最后一个非 NA 值(又名cohort)开始执行累积和(使用dplyr 中的cumsum())并继续在专栏CLV_for

为了更好地解释计算,我想把它分成两个不同的步骤。

1) 从群组2019-03-01 的CLV 列中的最后一个非NA 值开始,cumsum()CLV_for 中的相应值。队列 @9​​87654330@ 也是如此。

 df_inter <- data.frame(
  cohort = c("2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
             "2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
             "2019-03-01", "2019-04-01", "2019-04-01", "2019-04-01",
             "2019-04-01", "2019-04-01", "2019-04-01", "2019-04-01"),
  months = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7),
  CLV = c(59.9, 61.1, 62.06, 62.58, 62.83, NA, NA, NA, NA, 22.2, 38.24,
          46.08, 56.28, NA, NA, NA),
  cum_CLV_for = c(NA, NA, NA, NA, NA, 64.01, 65.19, 66.37, 67.32, NA,
                  NA, NA, NA, 63.43, 69.51, 74.51)
)

       cohort months   CLV cum_CLV_for
1  2019-03-01      1 59.90          NA
2  2019-03-01      2 61.10          NA
3  2019-03-01      3 62.06          NA
4  2019-03-01      4 62.58          NA
5  2019-03-01      5 62.83          NA
6  2019-03-01      6    NA       64.01 (<- 62.83 + 1.18)
7  2019-03-01      7    NA       65.19 (<- 64.01 + 1.18)
8  2019-03-01      8    NA       66.37 (<- 65.19 + 1.18)
9  2019-03-01      9    NA       67.32 (<- 66.37 + 0.95)
10 2019-04-01      1 22.20          NA
11 2019-04-01      2 38.24          NA
12 2019-04-01      3 46.08          NA
13 2019-04-01      4 56.28          NA
14 2019-04-01      5    NA       63.43 (<- 56.28 + 7.15)
15 2019-04-01      6    NA       69.51 (<- 63.43 + 6.08)
16 2019-04-01      7    NA       74.51 (<- 69.51 + 5.01)

2) 第二步,清理掉合并为一列的两列。

df_final <- data.frame(
                                      sub_date = c("2019-03-01", "2019-03-01", "2019-03-01",
                                                   "2019-03-01", "2019-03-01",
                                                   "2019-03-01", "2019-03-01",
                                                   "2019-03-01", "2019-03-01",
                                                   "2019-04-01", "2019-04-01",
                                                   "2019-04-01", "2019-04-01",
                                                   "2019-04-01", "2019-04-01",
                                                   "2019-04-01"),
                      months_after_acquisition = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7),
                                       cum_CLV = c(59.9, 61.1, 62.06, 62.58, 62.83, 64.01, 65.19,
                                                   66.37, 67.32, 22.2, 38.24,
                                                   46.08, 56.28, 63.43, 69.51,
                                                   74.51)
                   )

     sub_date months_after_acquisition cum_CLV
1  2019-03-01                        1   59.90
2  2019-03-01                        2   61.10
3  2019-03-01                        3   62.06
4  2019-03-01                        4   62.58
5  2019-03-01                        5   62.83
6  2019-03-01                        6   64.01
7  2019-03-01                        7   65.19
8  2019-03-01                        8   66.37
9  2019-03-01                        9   67.32
10 2019-04-01                        1   22.20
11 2019-04-01                        2   38.24
12 2019-04-01                        3   46.08
13 2019-04-01                        4   56.28
14 2019-04-01                        5   63.43
15 2019-04-01                        6   69.51
16 2019-04-01                        7   74.51

感谢您的帮助!

【问题讨论】:

    标签: r dplyr tidyverse


    【解决方案1】:

    通过采用CLVCLV 的垂直填充值与cumsum 结合,我们得到您想要的:

    df_input %>% 
      group_by(cohort) %>% 
      arrange(months, .by_group = T) %>% 
      mutate(cum_CLV = CLV) %>% 
      fill(cum_CLV) %>% 
      mutate(cum_CLV = cum_CLV + cumsum(CLV_for*is.na(CLV)))
    
    
    #  cohort     months   CLV CLV_for cum_CLV
    #    <fct>       <dbl> <dbl>   <dbl>   <dbl>
    #  1 2019-03-01      1  59.9    1.66    59.9
    #  2 2019-03-01      2  61.1    1.42    61.1
    #  3 2019-03-01      3  62.1    1.42    62.1
    #  4 2019-03-01      4  62.6    1.42    62.6
    #  5 2019-03-01      5  62.8    1.18    62.8
    #  6 2019-03-01      6  NA      1.18    64.0
    #  7 2019-03-01      7  NA      1.18    65.2
    #  8 2019-03-01      8  NA      1.18    66.4
    #  9 2019-03-01      9  NA      0.95    67.3
    # 10 2019-04-01      1  22.2   35.8     22.2
    # 11 2019-04-01      2  38.2   26.1     38.2
    # 12 2019-04-01      3  46.1   16.1     46.1
    # 13 2019-04-01      4  56.3   10.4     56.3
    # 14 2019-04-01      5  NA      7.15    63.4
    # 15 2019-04-01      6  NA      6.08    69.5
    # 16 2019-04-01      7  NA      5.01    74.5
    

    【讨论】:

      【解决方案2】:

      这是case_when 的一种方法:

      library(dplyr)
      df_input %>% 
        group_by(cohort) %>%
        mutate(CumCLV = cumsum(case_when(is.na(CLV) ~ CLV_for,
                                  TRUE ~ 0)),
               CLV = case_when(is.na(CLV) ~ CumCLV + max(CLV, na.rm = TRUE), 
                               TRUE ~ CLV)) %>%
        dplyr::select(-CLV_for, -CumCLV)
      
      # A tibble: 16 x 3
      # Groups:   cohort [2]
         cohort     months   CLV
         <fct>       <dbl> <dbl>
       1 2019-03-01      1  59.9
       2 2019-03-01      2  61.1
       3 2019-03-01      3  62.1
       4 2019-03-01      4  62.6
       5 2019-03-01      5  62.8
       6 2019-03-01      6  64.0
       7 2019-03-01      7  65.2
       8 2019-03-01      8  66.4
       9 2019-03-01      9  67.3
      10 2019-04-01      1  22.2
      11 2019-04-01      2  38.2
      12 2019-04-01      3  46.1
      13 2019-04-01      4  56.3
      14 2019-04-01      5  63.4
      15 2019-04-01      6  69.5
      16 2019-04-01      7  74.5
      

      【讨论】:

        【解决方案3】:

        为了完整性而采用 data.table 方法

        setDT(df_input)
        df_input[, max := max(CLV, na.rm = TRUE), by = cohort]
        df_input[ is.na(CLV), CLV := max + cumsum(CLV_for), by = cohort ][, c("max", "CLV_for") := NULL][]
        
        #        cohort months   CLV
        # 1: 2019-03-01      1 59.90
        # 2: 2019-03-01      2 61.10
        # 3: 2019-03-01      3 62.06
        # 4: 2019-03-01      4 62.58
        # 5: 2019-03-01      5 62.83
        # 6: 2019-03-01      6 64.01
        # 7: 2019-03-01      7 65.19
        # 8: 2019-03-01      8 66.37
        # 9: 2019-03-01      9 67.32
        # 10: 2019-04-01      1 22.20
        # 11: 2019-04-01      2 38.24
        # 12: 2019-04-01      3 46.08
        # 13: 2019-04-01      4 56.28
        # 14: 2019-04-01      5 63.43
        # 15: 2019-04-01      6 69.51
        # 16: 2019-04-01      7 74.52
        

        【讨论】:

          【解决方案4】:

          另一个dplyr 可能是:

          df_input %>%
           group_by(cohort) %>%
           transmute(months,
                     CLV = if_else(is.na(CLV), 
                                   last(na.omit(CLV)) + cumsum(CLV_for * is.na(CLV)),
                                   CLV))
          
             cohort     months   CLV
             <fct>       <dbl> <dbl>
           1 2019-03-01      1  59.9
           2 2019-03-01      2  61.1
           3 2019-03-01      3  62.1
           4 2019-03-01      4  62.6
           5 2019-03-01      5  62.8
           6 2019-03-01      6  64.0
           7 2019-03-01      7  65.2
           8 2019-03-01      8  66.4
           9 2019-03-01      9  67.3
          10 2019-04-01      1  22.2
          11 2019-04-01      2  38.2
          12 2019-04-01      3  46.1
          13 2019-04-01      4  56.3
          14 2019-04-01      5  63.4
          15 2019-04-01      6  69.5
          16 2019-04-01      7  74.5
          

          【讨论】:

            【解决方案5】:

            使用purrr::accumulate2()

            library(purrr)
            library(dplyr)
            
            df_input %>%
              group_by(cohort) %>%
              mutate(CLV = flatten_dbl(accumulate2(CLV, CLV_for[-1], .f = ~ if(!is.na(..2)) ..2 else ..1 + ..3))) %>%
              select(-CLV_for)
            
            # A tibble: 16 x 3
            # Groups:   cohort [2]
               cohort     months   CLV
               <chr>       <dbl> <dbl>
             1 2019-03-01      1  59.9
             2 2019-03-01      2  61.1
             3 2019-03-01      3  62.1
             4 2019-03-01      4  62.6
             5 2019-03-01      5  62.8
             6 2019-03-01      6  64.0
             7 2019-03-01      7  65.2
             8 2019-03-01      8  66.4
             9 2019-03-01      9  67.3
            10 2019-04-01      1  22.2
            11 2019-04-01      2  38.2
            12 2019-04-01      3  46.1
            13 2019-04-01      4  56.3
            14 2019-04-01      5  63.4
            15 2019-04-01      6  69.5
            16 2019-04-01      7  74.5
            

            【讨论】:

              猜你喜欢
              • 2017-02-14
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 2017-10-05
              • 2019-01-09
              • 2020-12-02
              • 1970-01-01
              • 2020-12-23
              相关资源
              最近更新 更多