使用开始日期和结束日期计算每月的保留率答案

【问题标题】：Calculating retention rate per month, using start and end dates使用开始日期和结束日期计算每月的保留率
【发布时间】：2021-09-26 09:21:00
【问题描述】：

我有一个如下所示的数据集（但有数万行）。我有一个身份证号码，以及一个开始和结束日期。我需要计算每月留存率。

我将保留率定义为：（月末患者总数 - 本月开始治疗的患者数）/（月初患者总数）。

我将如何在 R 中使用例如dplyr?

df <- data.frame(patient_ID= c("AA", "BB", "CC", "DD", "EE", "FF"),
                 treatment_start_date = as.Date(c("2004-01-01", "2007-01-01", "2012-04-01", "2014-04-01",
                                   "2019-04-01", "2020-04-01")),
                 treatment_end_date = as.Date(c("2014-12-31", "2017-03-31", "2018-03-31", "2019-03-31", 
                                 "2020-03-31", "2021-04-30")))

【问题讨论】：

您如何从您的数据中得出“当月新近接受治疗的患者”信息？
我们知道他们进入治疗的日期（治疗开始日期）
此处共享的数据的预期输出是什么？

标签： r dplyr data-manipulation data-cleaning lubridate

【解决方案1】：

如果我理解正确，您想总结一下每月的平均治疗时间。如果没有，请更具体地说明所需的输出。看下面的代码。

library(lubridate)
set.seed(2017)
options(digits=4)


df <- data.frame(patient_ID= c("AA", "BB", "CC", "DD", "EE", "FF"),
                 treatment_start_date = as.Date(c("2004-01-01", "2007-01-01", "2012-04-01", "2014-04-01",
                                                  "2019-04-01", "2020-04-01")),
                 treatment_end_date = as.Date(c("2014-12-31", "2017-03-31", "2018-03-31", "2019-03-31", 
                                                "2020-03-31", "2021-04-30")))


df$days <- as.Date(df$treatment_end_date, format="%Y/%m/%d") -
  as.Date(df$treatment_start_date, format="%Y/%m/%d")

df_per_month <- df %>%  group_by(month=floor_date(treatment_start_date, "month")) %>%
  summarise(mean_month=mean(days))

这是你的想法吗？以天为单位的平均治疗期是在治疗开始时计算的。

【讨论】：

【解决方案2】：

所以，我想我明白你想要什么：

每月月底接受治疗的患者人数
每月开始治疗的患者人数
每月月初接受治疗的患者人数*

*这是否包括在当月第一天开始治疗的患者（例如您的样本数据中的每个人）？ - 我假设在这个例子中是这样的。

因此，请加载您的示例数据并确保日期采用正确的date 格式

df <- data.frame(patient_ID= c("AA", "BB", "CC", "DD", "EE", "FF"),
                 treatment_start_date = as.Date(c("2004-01-01", "2007-01-01", "2012-04-01", "2014-04-01",
                                                  "2019-04-01", "2020-04-01")),
                 treatment_end_date = as.Date(c("2014-12-31", "2017-03-31", "2018-03-31", "2019-03-31", 
                                                "2020-03-31", "2021-04-30")))

# make sure the dates are date format
df %>% 
  as_tibble %>%
  mutate(across(treatment_start_date:treatment_end_date, ~ymd(.))) %>% 
  {. ->> df_1}

df_1

# # A tibble: 6 x 3
# patient_ID treatment_start_date treatment_end_date
# <chr>      <date>               <date>            
# AA         2004-01-01           2014-12-31        
# BB         2007-01-01           2017-03-31        
# CC         2012-04-01           2018-03-31        
# DD         2014-04-01           2019-03-31        
# EE         2019-04-01           2020-03-31        
# FF         2020-04-01           2021-04-30

然后，我们对每位患者接受治疗的每个日期（包括开始和结束日期）进行排序。

# make a sequence of every date a patient was in the treatment
df_1 %>% 
  rowwise %>% 
  mutate(
    treatment_days = list(seq(treatment_start_date, treatment_end_date, by = 'day'))
  ) %>% 
  select(patient_ID, treatment_days) %>% 
  unnest(cols = c('treatment_days')) %>% 
  {. ->> df_2}

df_2

# # A tibble: 12,539 x 2
# patient_ID treatment_days
# <chr>      <date>        
# AA         2004-01-01    
# AA         2004-01-02    
# AA         2004-01-03    
# AA         2004-01-04    
# AA         2004-01-05    
# AA         2004-01-06    
# AA         2004-01-07    
# AA         2004-01-08    
# AA         2004-01-09    
# AA         2004-01-10    
# # ... with 12,529 more rows

然后，我们计算出每天有多少患者在接受治疗，并且只保留每个月的第一天和最后一天。

df_2 %>% 
  
  # work out how many patients were in treatment for each day
  group_by(treatment_days) %>% 
  summarise(
    n_patients = n_distinct(patient_ID)
  ) %>% 
  
  # make month column
  mutate(
    month = format(treatment_days, format = '%Y-%m')
  ) %>% 
  
  # keep only the first and last days of each month
  group_by(month) %>% 
  filter(
    day(treatment_days) == 1 | day(treatment_days) == max(day(treatment_days))
  ) %>% 
  
  # determine number of patients at the start and end of each month
  #    ensure the dates are in order
  arrange(month, treatment_days) %>% 
  group_by(month) %>% 
  summarise(
    n_patient_start = nth(n_patients, 1), 
    n_patient_end = nth(n_patients, 2), 
  ) %>% 
  
  {. ->> df_3}

df_3

# # A tibble: 208 x 3
# month   n_patient_start n_patient_end
# <chr>             <int>         <int>
# 2004-01               1             1
# 2004-02               1             1
# 2004-03               1             1
# 2004-04               1             1
# 2004-05               1             1
# 2004-06               1             1
# 2004-07               1             1
# 2004-08               1             1
# 2004-09               1             1
# 2004-10               1             1
# # ... with 198 more rows

所以，现在我们有了每个月初和月底接受治疗的患者总数。

在计算留存率之前，我们需要知道每个月有多少患者开始接受治疗，因此可以将其用于计算留存率。

# how many patients started each month?
df_1 %>% 
  select(patient_ID, treatment_start_date) %>% 
  mutate(
    month = format(treatment_start_date, format = '%Y-%m')
  ) %>% 
  group_by(month) %>% 
  summarise(
    n_starting_patients = n_distinct(patient_ID)
  ) %>% 
  {. ->> n_new_per_month}

n_new_per_month

# # A tibble: 6 x 2
# month   n_starting_patients
# <chr>                 <int>
# 2004-01                   1
# 2007-01                   1
# 2012-04                   1
# 2014-04                   1
# 2019-04                   1
# 2020-04                   1

我们将每月开始和月底的开始患者人数与活跃患者人数相结合。然后，我们可以根据您问题中的公式计算留存率。

# now, we join in new patients per month
df_3 %>% 
  left_join(n_new_per_month) %>% 
  mutate(
    n_starting_patients = ifelse(is.na(n_starting_patients), 0, n_starting_patients)
  ) %>% 
  
  # calculate retention rate
  mutate(
    ret_rate = (n_patient_end - n_starting_patients) / n_patient_start
  )

# # A tibble: 208 x 5
# month   n_patient_start n_patient_end n_starting_patients ret_rate
# <chr>             <int>         <int>               <dbl>    <dbl>
# 2004-01               1             1                   1        0
# 2004-02               1             1                   0        1
# 2004-03               1             1                   0        1
# 2004-04               1             1                   0        1
# 2004-05               1             1                   0        1
# 2004-06               1             1                   0        1
# 2004-07               1             1                   0        1
# 2004-08               1             1                   0        1
# 2004-09               1             1                   0        1
# 2004-10               1             1                   0        1
# # ... with 198 more rows

现在我不太确定这是否正确，因为例如在上面的预览中 - 我们在 2004 年 1 月的保留率是 0，尽管这个月没有患者丢失（实际上我们有一开始治疗）。这是因为患者 AA 于 1 月 1 日开始，因此保留率计算为(number of patients at the end of the month - number of patients that started in that month) / number of patients at the start of the month，或(1 - 1) / 1 = 0 / 1 = 0。

使用当前公式会影响留存率的因素：

开始和结束日期是否包含在患者接受治疗的日期中？
如果患者在当月的第一天开始，这是否意味着他们被包括或从“患者在月初删除”？如果您想删除在月中开始的患者数量，我可以理解，但对于当前的公式，这对我来说不太有意义。

【讨论】：