【发布时间】:2021-04-07 04:28:32
【问题描述】:
我有一个报告合同开始和结束日期的数据框,看起来像这样:
df <- structure(list(dyadID = c(2, 3, 4, 2, 2, 5, 5, 1, 13765, 13765, 13765, 13765, 43164, 43164, 43164),
employeesID = c("Alf", "Alf","Alf", "Alf", "Alf", "Alf", "Alf", "Alf", "Bet", "Bet", "Bet", "Bet", "Gam", "Gam", "Gam"),
employersID = c("31974", "32009", "32040", "31974", "31974", "358291", "358291", "31665", "31345", "31345", "31345", "31345", "363109", "363109", "363109"),
start_date = structure(c(15613, 15863, 15937, 16295, 16299, 17037, 17045, 17136, 15692, 16097, 16141, 16513, 17116, 17554, 17913), class = "Date"),
end_date = structure(c(15862, 15937, 16295, 16297, 17036, 17044, 17136, NA, 16067, 16141, 16505, NA, 17543, 17907, 18272), class = "Date")),
row.names = c(NA,-15L), class = c("data.table", "data.frame"))
dyadID employeesID employersID start_date end_date
1: 2 Alf 31974 2012-09-30 2013-06-06
2: 3 Alf 32009 2013-06-07 2013-08-20
3: 4 Alf 32040 2013-08-20 2014-08-13
4: 2 Alf 31974 2014-08-13 2014-08-15
5: 2 Alf 31974 2014-08-17 2016-08-23
6: 5 Alf 358291 2016-08-24 2016-08-31
7: 5 Alf 358291 2016-09-01 2016-12-01
8: 1 Alf 31665 2016-12-01 <NA>
9: 13765 Bet 31345 2012-12-18 2013-12-28
10: 13765 Bet 31345 2014-01-27 2014-03-12
11: 13765 Bet 31345 2014-03-12 2015-03-11
12: 13765 Bet 31345 2015-03-19 <NA>
13: 43164 Gam 363109 2016-11-11 2018-01-12
14: 43164 Gam 363109 2018-01-23 2019-01-11
15: 43164 Gam 363109 2019-01-17 2020-01-11
员工随着时间的推移签订多份合同。
例如,第一行显示Alf 与employersID==31974 在2012-09-30 上签署了合同,并且合同在2013-06-06 上结束。第二行显示Alf 与employersID==32009 在2013-06-07 上签订了新合同。
有时同一员工与同一雇主签订两份连续合同(例如第 4 行和第 5 行)。有时三个甚至四个(在实际数据中最多 9 个)连续合同(例如第 13-16 行和第 9-13 行)。
我想将员工连续签署多份合同的这些观察结果合并到一行中,以便该行报告start_date 和此关系的“end_date”。
最终的数据集应该如下所示:
dyadID employeesID employersID start_date end_date
1: 2 Alf 31974 2012-09-30 2013-06-06
2: 3 Alf 32009 2013-06-07 2013-08-20
3: 4 Alf 32040 2013-08-20 2014-08-13
5: 2 Alf 31974 2014-08-13 2016-08-23 # collapsed observation (one time), keeping start_date of the first collapsed observation and end_date of the last collapsed observation
6: 5 Alf 358291 2016-08-24 2016-12-01 # collapsed one time
7: 1 Alf 31665 2016-12-01 <NA>
8: 13765 Bet 31345 2012-12-18 <NA> # collapsed observation (3 times),keeping start_date of the first collapsed observation and end_date of the last collapsed observation
13: 43164 Gam 363109 2016-11-11 2020-01-11 # collapsed observation (2 times),keeping start_date of the first collapsed observation and end_date of the last collapsed observation
出于这个目的,我尝试了以下方法,但看起来不是很直接,当我需要多次调整日期时它不起作用
df <- setDT(df)[order(employeesID,start_date), same_dyd := ifelse(dyadID==lag(dyadID),1,0),
by=.(employeesID) # this identifies the observations I need to collapse
][is.na(same_dyd),same_dyd:=0
][order(employeesID,start_date),
new_start_date:=if_else(same_dyd==1,lag(start_date),start_date)] # this creates a new variable with the correct date when there is only one new contract.
但是这个系统效率不高,不折叠变量,当我需要做多次折叠时new_start_date变量不正确。
有没有人有解决这个问题的建议?
非常感谢您的帮助!
【问题讨论】:
-
是的,我编辑了这个
标签: r data.table data-manipulation