折叠多行，将某些行的值保留在一个变量中，将另一行的值保留在另一个变量中答案

【问题标题】：Collapse multiple rows, keeping values of some rows in one variable and values of another row in another variable折叠多行，将某些行的值保留在一个变量中，将另一行的值保留在另一个变量中
【发布时间】：2021-04-07 04:28:32
【问题描述】：

我有一个报告合同开始和结束日期的数据框，看起来像这样：



df <- structure(list(dyadID = c(2, 3, 4, 2, 2, 5, 5, 1, 13765, 13765, 13765, 13765, 43164, 43164, 43164), 
                     employeesID = c("Alf", "Alf","Alf", "Alf", "Alf", "Alf", "Alf", "Alf", "Bet", "Bet", "Bet", "Bet", "Gam", "Gam", "Gam"), 
                     employersID = c("31974", "32009", "32040", "31974", "31974", "358291", "358291", "31665", "31345", "31345", "31345", "31345", "363109", "363109", "363109"), 
                     start_date = structure(c(15613, 15863, 15937, 16295, 16299, 17037, 17045, 17136, 15692, 16097, 16141, 16513, 17116, 17554, 17913), class = "Date"), 
                     end_date = structure(c(15862, 15937, 16295, 16297, 17036, 17044, 17136, NA, 16067, 16141, 16505, NA, 17543, 17907, 18272), class = "Date")), 
                row.names = c(NA,-15L), class = c("data.table", "data.frame"))

    dyadID employeesID employersID start_date   end_date
 1:      2        Alf      31974 2012-09-30 2013-06-06
 2:      3        Alf      32009 2013-06-07 2013-08-20
 3:      4        Alf      32040 2013-08-20 2014-08-13
 4:      2        Alf      31974 2014-08-13 2014-08-15
 5:      2        Alf      31974 2014-08-17 2016-08-23
 6:      5        Alf     358291 2016-08-24 2016-08-31
 7:      5        Alf     358291 2016-09-01 2016-12-01
 8:      1        Alf      31665 2016-12-01       <NA>
 9:  13765        Bet      31345 2012-12-18 2013-12-28
10:  13765        Bet      31345 2014-01-27 2014-03-12
11:  13765        Bet      31345 2014-03-12 2015-03-11
12:  13765        Bet      31345 2015-03-19       <NA>
13:  43164        Gam     363109 2016-11-11 2018-01-12
14:  43164        Gam     363109 2018-01-23 2019-01-11
15:  43164        Gam     363109 2019-01-17 2020-01-11

员工随着时间的推移签订多份合同。

例如，第一行显示Alf 与employersID==31974 在2012-09-30 上签署了合同，并且合同在2013-06-06 上结束。第二行显示Alf 与employersID==32009 在2013-06-07 上签订了新合同。

有时同一员工与同一雇主签订两份连续合同（例如第 4 行和第 5 行）。有时三个甚至四个（在实际数据中最多 9 个）连续合同（例如第 13-16 行和第 9-13 行）。

我想将员工连续签署多份合同的这些观察结果合并到一行中，以便该行报告start_date 和此关系的“end_date”。

最终的数据集应该如下所示：

    dyadID employeesID employersID start_date   end_date
 1:      2        Alf      31974 2012-09-30 2013-06-06
 2:      3        Alf      32009 2013-06-07 2013-08-20
 3:      4        Alf      32040 2013-08-20 2014-08-13
 5:      2        Alf      31974 2014-08-13 2016-08-23 # collapsed observation (one time), keeping start_date of the first collapsed observation and end_date of the last collapsed observation
 6:      5        Alf     358291 2016-08-24 2016-12-01 # collapsed one time
 7:      1        Alf      31665 2016-12-01       <NA>
 8:  13765        Bet      31345 2012-12-18       <NA> # collapsed observation (3 times),keeping start_date of the first collapsed observation and end_date of the last collapsed observation
13:  43164        Gam     363109 2016-11-11 2020-01-11 # collapsed observation (2 times),keeping start_date of the first collapsed observation and end_date of the last collapsed observation

出于这个目的，我尝试了以下方法，但看起来不是很直接，当我需要多次调整日期时它不起作用

df <- setDT(df)[order(employeesID,start_date), same_dyd := ifelse(dyadID==lag(dyadID),1,0),
by=.(employeesID) # this identifies the observations I need to collapse
               ][is.na(same_dyd),same_dyd:=0
      ][order(employeesID,start_date), 
new_start_date:=if_else(same_dyd==1,lag(start_date),start_date)] # this creates a new variable with the correct date when there is only one new contract.

但是这个系统效率不高，不折叠变量，当我需要做多次折叠时new_start_date变量不正确。

有没有人有解决这个问题的建议？

非常感谢您的帮助！

【问题讨论】：

是的，我编辑了这个

标签： r data.table data-manipulation

【解决方案1】：

我们可以按'dyadID'、'dyadID'、employersID'、'employersID'的run-length-id进行分组，分别获取'start_date'和'end_date'的first和last元素进行汇总

library(data.table)
df[, .(start_date = first(start_date),
   end_date = last(end_date)),
     .(grp = rleid(dyadID), dyadID, employeesID, employersID)]

如果我们想保留每组第一行的列值，请使用.I 创建一个行索引，并使用它来提取行，即原始数据中不在摘要中的列

out <- df[, .(start_date = first(start_date),
 end_date = last(end_date), rn = .I[1]),
   .(grp = rleid(dyadID), dyadID, employeesID, employersID)]
cbind(out, df[out$rn, setdiff(names(df), names(out)), with = FALSE])

【讨论】：

哇！！！！这非常简洁和高效！！！谢谢@akrun！我做梦也不会想到这样的事情！您介意详细说明一下这背后的逻辑吗？
@Alex 有一行不匹配。是不是因为shift(start_date, type= 'lead') == end_date
你的意思是原始df的第7行吗？你是对的，我没有错误地折叠它。你的输出是正确的。我还想知道有没有一种方法可以使用相同的方法，但也可以在 df 中保留其他变量（对于相同的 dyads 应该是相同的）？
@Alex 是可以做到的，但是我这里有个疑问。您想保留哪一行。使用start_date，我们保留第一个元素和end_date，last 一个
非常好用！！非常感谢！