计算按年划分的保留率答案

【问题标题】：Calculate retention rate split by year计算按年划分的保留率
【发布时间】：2020-01-09 01:13:08
【问题描述】：

按年份计算留存率/流失率

尊敬的社区，我正在从事一个数据挖掘项目，我想将之前的想法从 excel 转变为 R。

我有一个包含合同数据的客户数据库，并且想计算保留率。我在玩这些library(lubridate)； library(reshape2); library(plyr) 但我不知道它在 R 中是如何工作的。

我有这样的数据：

ID    Customer        START          END
 1       Tesco   01-01-2000   31-12-2000
 2       Apple   05-11-2001   06-02-2002
 3         H&M   01-02-2002   08-05-2002
 4        Tesco  01-01-2001   31-12-2001
 5       Apple   01-01-2003   31-12-2004

我现在正在考虑将数据拆分为年份（df2000、df2001），然后在主表中存在客户名称时再次查找（如果是则返回 1）。

结果可能如下所示：

Customer     2000    2001    2002  2003   Retention Rate
Tesco         1        1      0     0          0.5
Apple         0        1      0     1
H&M           0        0      1     0

【问题讨论】：

标签： r database split retention churn

【解决方案1】：

使用dplyr，您可以尝试从每个START 日期中获取year 值，count 每个Customer 和year 的条目数，计算留存率和spread 数据到宽格式。

library(dplyr)
df %>%
  mutate(year = format(as.Date(START, format = "%d-%m-%Y"), "%Y")) %>%
  dplyr::count(Customer, year) %>%
  group_by(Customer) %>%
  mutate(ret = n()/n_distinct(.$year))  %>%
  tidyr::spread(year, n, fill = 0) 

#  Customer   ret  `2000` `2001` `2002` `2003`
#  <fct>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#1 Apple     0.5       0      1      0      1
#2 H&M       0.25      0      0      1      0
#3 Tesco     0.5       1      1      0      0

编辑

要考虑财政年度的数据，而不是从 10 月到 9 月，我们可以这样做

library(lubridate)

df %>%
  mutate(START = dmy(START), 
         START = if_else(month(START) >= 10, START + years(1), START),
         year = year(START)) %>%
  dplyr::count(Customer, year) %>%
  group_by(Customer) %>%
  mutate(ret = n()/n_distinct(.$year))  %>%
  tidyr::spread(year, n, fill = 0)

数据

df <- structure(list(ID = 1:5, Customer = structure(c(3L, 1L, 2L, 3L, 
1L), .Label = c("Apple", "H&M", "Tesco"), class = "factor"), 
START = structure(c(1L, 5L, 4L, 2L, 3L), .Label = c("01-01-2000", 
"01-01-2001", "01-01-2003", "01-02-2002", "05-11-2001"), class = "factor"), 
END = structure(c(3L, 1L, 2L, 4L, 5L), .Label = c("06-02-2002", 
"08-05-2002", "31-12-2000", "31-12-2001", "31-12-2004"), class = "factor")), 
class = "data.frame", row.names = c(NA, -5L))

【讨论】：

您好 Ronak Shah，感谢您提供的代码！但是，当我应用我的真实数据时，我得到了错误： Fehler in as.Date.default(start, format = "%m/%d/%y") : do not know how to convert 'start' to class "Date ”
@Lebowski 您是否使用了正确的列名？在您的示例中，它是START，而您使用的是start，大小写很重要。
你是明星！ $
另一个问题：而不是使用开始年份。我想使用标准财政年度。例如，如果客户存在于 2005 年签订的合同且合同在 1.10.2004 - 30.9.2005 的范围内，则在 2005 年返回 1。
@Lebowski 你能在 EDIT 之后查看更新后的答案吗？