【发布时间】:2020-10-02 07:41:39
【问题描述】:
我的初始数据框如下所示:
library(tidyverse)
df_input <- data.frame(
cohort = c("2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
"2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
"2019-03-01", "2019-04-01", "2019-04-01", "2019-04-01",
"2019-04-01", "2019-04-01", "2019-04-01", "2019-04-01"),
months = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7),
CLV = c(59.9, 61.1, 62.06, 62.58, 62.83, NA, NA, NA, NA, 22.2, 38.24,
46.08, 56.28, NA, NA, NA),
CLV_for = c(1.66, 1.42, 1.42, 1.42, 1.18, 1.18, 1.18, 1.18, 0.95, 35.75,
26.1, 16.09, 10.37, 7.15, 6.08, 5.01)
)
cohort months CLV CLV_for
1 2019-03-01 1 59.90 1.66
2 2019-03-01 2 61.10 1.42
3 2019-03-01 3 62.06 1.42
4 2019-03-01 4 62.58 1.42
5 2019-03-01 5 62.83 1.18
6 2019-03-01 6 NA 1.18
7 2019-03-01 7 NA 1.18
8 2019-03-01 8 NA 1.18
9 2019-03-01 9 NA 0.95
10 2019-04-01 1 22.20 35.75
11 2019-04-01 2 38.24 26.10
12 2019-04-01 3 46.08 16.09
13 2019-04-01 4 56.28 10.37
14 2019-04-01 5 NA 7.15
15 2019-04-01 6 NA 6.08
16 2019-04-01 7 NA 5.01
我想从CLV 列中每个组中的最后一个非 NA 值(又名cohort)开始执行累积和(使用dplyr 中的cumsum())并继续在专栏CLV_for。
为了更好地解释计算,我想把它分成两个不同的步骤。
1) 从群组2019-03-01 的CLV 列中的最后一个非NA 值开始,cumsum() 列CLV_for 中的相应值。队列 @987654330@ 也是如此。
df_inter <- data.frame(
cohort = c("2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
"2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
"2019-03-01", "2019-04-01", "2019-04-01", "2019-04-01",
"2019-04-01", "2019-04-01", "2019-04-01", "2019-04-01"),
months = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7),
CLV = c(59.9, 61.1, 62.06, 62.58, 62.83, NA, NA, NA, NA, 22.2, 38.24,
46.08, 56.28, NA, NA, NA),
cum_CLV_for = c(NA, NA, NA, NA, NA, 64.01, 65.19, 66.37, 67.32, NA,
NA, NA, NA, 63.43, 69.51, 74.51)
)
cohort months CLV cum_CLV_for
1 2019-03-01 1 59.90 NA
2 2019-03-01 2 61.10 NA
3 2019-03-01 3 62.06 NA
4 2019-03-01 4 62.58 NA
5 2019-03-01 5 62.83 NA
6 2019-03-01 6 NA 64.01 (<- 62.83 + 1.18)
7 2019-03-01 7 NA 65.19 (<- 64.01 + 1.18)
8 2019-03-01 8 NA 66.37 (<- 65.19 + 1.18)
9 2019-03-01 9 NA 67.32 (<- 66.37 + 0.95)
10 2019-04-01 1 22.20 NA
11 2019-04-01 2 38.24 NA
12 2019-04-01 3 46.08 NA
13 2019-04-01 4 56.28 NA
14 2019-04-01 5 NA 63.43 (<- 56.28 + 7.15)
15 2019-04-01 6 NA 69.51 (<- 63.43 + 6.08)
16 2019-04-01 7 NA 74.51 (<- 69.51 + 5.01)
2) 第二步,清理掉合并为一列的两列。
df_final <- data.frame(
sub_date = c("2019-03-01", "2019-03-01", "2019-03-01",
"2019-03-01", "2019-03-01",
"2019-03-01", "2019-03-01",
"2019-03-01", "2019-03-01",
"2019-04-01", "2019-04-01",
"2019-04-01", "2019-04-01",
"2019-04-01", "2019-04-01",
"2019-04-01"),
months_after_acquisition = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7),
cum_CLV = c(59.9, 61.1, 62.06, 62.58, 62.83, 64.01, 65.19,
66.37, 67.32, 22.2, 38.24,
46.08, 56.28, 63.43, 69.51,
74.51)
)
sub_date months_after_acquisition cum_CLV
1 2019-03-01 1 59.90
2 2019-03-01 2 61.10
3 2019-03-01 3 62.06
4 2019-03-01 4 62.58
5 2019-03-01 5 62.83
6 2019-03-01 6 64.01
7 2019-03-01 7 65.19
8 2019-03-01 8 66.37
9 2019-03-01 9 67.32
10 2019-04-01 1 22.20
11 2019-04-01 2 38.24
12 2019-04-01 3 46.08
13 2019-04-01 4 56.28
14 2019-04-01 5 63.43
15 2019-04-01 6 69.51
16 2019-04-01 7 74.51
感谢您的帮助!
【问题讨论】: