【发布时间】:2021-12-10 12:44:48
【问题描述】:
假设我有两张表(DT_sportA 和 DT_sportB),用于测量两个孩子(id)参加运动“A”和“B”的时间段。
library(data.table)
library(lubridate)
DT_sportA <- data.table(id = rep(1:2,each=2),
start_date = ymd(c("2000-01-01","2002-01-15","2014-03-12","2016-10-14")),
end_date = ymd(c("2000-02-03","2003-03-01","2014-04-03","2017-05-19")))
DT_sportA
# id start_date end_date
# 1: 1 2000-01-01 2000-02-03
# 2: 1 2002-01-15 2003-03-01
# 3: 2 2014-03-12 2014-04-03
# 4: 2 2016-10-14 2017-05-19
DT_sportB <- data.table(id = c(1L,1L,2L),
start_date = ymd(c("2000-01-15","2002-01-15","2017-02-10")),
end_date = ymd(c("2000-02-01","2006-03-19","2017-02-20")))
DT_sportB
# id start_date end_date
# 1: 1 2000-01-15 2000-02-01
# 2: 1 2002-01-15 2006-03-19
# 3: 2 2017-02-10 2017-02-20
我想生成一个新表,其中包含所有唯一且重叠的日期范围,其中两个分类指标表示孩子们参加的运动。所需的 DT 应如下所示:
id start_date end_date sportA sportB
1: 1 2000-01-01 2000-01-14 1 0
2: 1 2000-01-15 2000-02-01 1 1
3: 1 2000-02-02 2000-02-03 1 0
4: 1 2002-01-15 2002-03-01 1 1
5: 1 2002-03-02 2002-03-19 0 1
6: 2 2014-03-12 2014-04-03 1 0
7: 2 2016-10-14 2017-02-09 1 0
8: 2 2017-02-10 2017-02-20 1 1
9: 2 2017-02-21 2017-05-19 1 0
这是一个相当简单的玩具示例。真实数据跨越数百万行和大约 20 个“运动”,这就是我寻找data.table 解决方案的原因。
【问题讨论】:
标签: r data.table