【问题标题】:R - Calculating days of overlapping among dates intervalsR - 计算日期间隔之间的重叠天数
【发布时间】:2021-09-16 15:24:06
【问题描述】:

我为大量用户列出了不同产品的开始日期和结束日期。不同产品的购买间隔可能重叠或有时间间隔:

user_id start_date  end_date    product
    12  31/10/2010  31/10/2011  A
    12  18/12/2010  18/12/2011  A
    12  31/10/2011  28/04/2014  B
    12  18/12/2011  18/12/2014  A
    12  27/03/2014  27/03/2015  A
    12  18/12/2014  18/12/2016  B
    12  27/03/2015  27/03/2016  B
    12  18/12/2016  18/12/2017  D
    33  01/07/1992  01/07/2016  A
    33  20/08/1993  16/08/2016  B
    33  28/10/1999  15/11/2012  A
    33  31/01/2006  28/02/2006  B
    33  26/08/2016  26/01/2017  C

我想获得每位患者所有潜在产品组合的重叠天数。

user_id A_B       A_C   A_D      B_C    B_D      C_D
12      20 days 0 days  10 days 0 days  0 days  0 days
33      10 days 0 days  0 days  0 days  20 days 20 days
                    

是否有一种快速而优雅的编码方式,希望在 dplyr 中?

感谢您的帮助!

代码:

   library(lubridate)
    library(Hmisc)
    library(dplyr)

user_id <- c(rep(12, 8), rep(33, 5))

start_date <- dmy(Cs(31/10/2010,    18/12/2010, 31/10/2011, 18/12/2011, 27/03/2014, 18/12/2014, 27/03/2015, 18/12/2016, 01/07/1992, 20/08/1993, 28/10/1999, 31/01/2006, 26/08/2016))

end_date <- dmy(Cs(31/10/2011,  18/12/2011, 28/04/2014, 18/12/2014, 27/03/2015, 18/12/2016, 27/03/2016, 18/12/2017,
               01/07/2016,  16/08/2016, 15/11/2012, 28/02/2006, 26/01/2017))


 product <- c("A", "A","B","A","A","B","B","D","A","B","A","B", "C")


data <- data.frame(user_id, start_date, end_date, product )

【问题讨论】:

标签: r


【解决方案1】:

这里是解决方案。 首先,我们创建适当的数据表。请注意,我稍微修改了您的数据。

library(tidyverse)
library(lubridate)

df = read.table(
  header = TRUE,text="
user_id start_date  end_date    product
    12  31/10/2010  31/10/2011  A
    12  18/12/2010  18/12/2011  A
    12  31/10/2011  28/04/2014  B
    12  18/12/2011  18/12/2014  A
    12  27/03/2014  27/03/2015  A
    12  18/12/2014  18/12/2016  B
    12  27/03/2015  27/03/2016  B
    12  18/01/2016  18/12/2017  D
    33  01/07/1992  01/07/2016  A
    33  20/08/1993  16/08/2016  B
    33  28/10/1999  15/11/2012  A
    33  31/01/2006  28/02/2006  B
    33  26/08/2006  26/01/2017  C
") %>% as_tibble()

现在我们正在为数据添加时间间隔

df1 = df %>% mutate(
  start_date = start_date %>% dmy(),
  end_date = end_date %>% dmy(),
  product = product %>% fct_infreq(),
  dateint = interval(start_date, end_date)
)

输出

# A tibble: 13 x 5
   user_id start_date end_date   product dateint                       
     <int> <date>     <date>     <fct>   <Interval>                    
 1      12 2010-10-31 2011-10-31 A       2010-10-31 UTC--2011-10-31 UTC
 2      12 2010-12-18 2011-12-18 A       2010-12-18 UTC--2011-12-18 UTC
 3      12 2011-10-31 2014-04-28 B       2011-10-31 UTC--2014-04-28 UTC
 4      12 2011-12-18 2014-12-18 A       2011-12-18 UTC--2014-12-18 UTC
 5      12 2014-03-27 2015-03-27 A       2014-03-27 UTC--2015-03-27 UTC
 6      12 2014-12-18 2016-12-18 B       2014-12-18 UTC--2016-12-18 UTC
 7      12 2015-03-27 2016-03-27 B       2015-03-27 UTC--2016-03-27 UTC
 8      12 2016-01-18 2017-12-18 D       2016-01-18 UTC--2017-12-18 UTC
 9      33 1992-07-01 2016-07-01 A       1992-07-01 UTC--2016-07-01 UTC
10      33 1993-08-20 2016-08-16 B       1993-08-20 UTC--2016-08-16 UTC
11      33 1999-10-28 2012-11-15 A       1999-10-28 UTC--2012-11-15 UTC
12      33 2006-01-31 2006-02-28 B       2006-01-31 UTC--2006-02-28 UTC
13      33 2006-08-26 2017-01-26 C       2006-08-26 UTC--2017-01-26 UTC

现在让我们创建三个简单的辅助函数。 fDayInt 函数返回两个时间间隔的公共部分的天数。 函数fSumDayInt 返回作为参数给出的两个产品的重叠间隔天数的总和。 函数fSumComb 将返回所有产品组合的天数总和。

fDayInt = function(int1, int2) intersect(int1, int2) %>% 
  as.numeric(.)/(60*60*24)

fSumDayInt = function(df, product1, product2){
  df1A = df %>% 
    filter(product == product1) %>% 
    select(dateint) %>% 
    mutate(join = 1)
  df2B = df %>% 
    filter(product == product2) %>% 
    select(dateint) %>% 
    mutate(join = 1)
  df1A %>% left_join(df2B, by="join") %>% 
    mutate(nday = fDayInt(dateint.x, dateint.y)) %>% 
    summarise(sum.day = sum(nday, na.rm=TRUE)) %>% pull(sum.day)
}

fSumComb = function(df) tibble(
  A_B = df %>% fSumDayInt("A", "B"),
  A_C = df %>% fSumDayInt("A", "C"),
  A_D = df %>% fSumDayInt("A", "D"),
  B_C = df %>% fSumDayInt("B", "C"),
  B_D = df %>% fSumDayInt("B", "D"),
  C_D = df %>% fSumDayInt("C", "D")
)

我们最后做的只是儿戏!

df1 %>% group_by(user_id) %>% 
  group_modify(~fSumComb(.x))

输出

# A tibble: 2 x 7
# Groups:   user_id [2]
  user_id   A_B   A_C   A_D   B_C   B_D   C_D
    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1      12  1041     0     0     0   404     0
2      33 13174  5870     0  3643     0     0

希望这是你所期望的。

【讨论】:

  • 非常感谢,Marek,这真的很有帮助!
猜你喜欢
  • 1970-01-01
  • 2019-02-12
  • 2012-10-28
  • 2016-09-08
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-02-28
相关资源
最近更新 更多