【发布时间】:2019-09-12 23:16:20
【问题描述】:
我需要识别在特定时间段内发生两个事件的实例,如下所示。如果事件 A 先发生,则事件 B 必须在 24 小时内发生。另一方面,如果 B 先出现,则需要在 72 小时内找到 A。此外,当满足条件时,我需要“开始”时间,即第一个事件发生的时间。
事件 A
structure(list(fake_id = c("1000686267", "1000686267", "1000686267",
"1000686267", "1000686267", "1000686267", "1000686267", "1070640921",
"1070640921", "1070640921", "1070640921", "1070640921", "1070640921",
"1184695414", "1184695414", "1184695414", "1184695414", "1184695414"
), date = structure(c(1515063600, 1514822400, 1514822400, 1514822400,
1514822400, 1515146400, 1514901600, 1515330000, 1514822400, 1514822400,
1514822400, 1514822400, 1517385600, 1516701600, 1515142800, 1515178800,
1515178800, 1516557600), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-18L), class = "data.frame", .Names = c("fake_id",
"date"))
事件 B
structure(list(fake_id = c("1000686267", "1000686267", "1000686267",
"1000686267", "1000686267", "1000686267", "1000686267", "1000686267",
"1000686267", "1000686267", "1000686267", "1000686267", "1000686267",
"1000686267", "1000686267", "1000686267", "1000686267", "1070640921",
"1070640921", "1070640921", "1070640921", "1070640921", "1070640921",
"1184695414", "1184695414", "1184695414", "1184695414", "1184695414",
"1184695414", "1184695414"), date = structure(c(1516795200, 1516795200,
1516795200, 1516917600, 1517400000, 1517400000, 1515492000, 1515492000,
1516190400, 1516190400, 1517410800, 1517410800, 1516921200, 1515070800,
1515070800, 1515052800, 1516633200, 1517374800, 1515322800, 1515322800,
1516525200, 1515232800, 1516543200, 1516550400, 1515189600, 1516543200,
1516543200, 1515142800, 1515142800, 1515142800), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -30L), class = "data.frame", .Names = c("fake_id",
"date"))
一些代码
library (data.table)
event_a <- data.table(event_a[, c("fake_id", "date"), with = FALSE])
event_b <- data.table(event_b[, c("fake_id", "date"), with = FALSE])
event_a[, `:=`("criteria_a", "criteria_a")]
event_b[, `:=`("criteria_b", "criteria_b")]
setkeyv(event_a, c("fake_id", "date"))
setkeyv(event_b, c("fake_id", "date"))
join_window <- 60 * 60 * c(24, 72)
event_subset_a <- event_a[event_b, roll = join_window[1]]
event_subset_b <- event_b[event_a, roll = join_window[2]]
event_df <- rbind(event_subset_a, event_subset_b)
event_df[, `:=`(c("criteria_a", "criteria_b"), NULL)]
setkeyv(event_df, c("fake_id", "date"))
event_df <- unique(event_df)
电流输出
fake_id date
1 1184695414 2018-01-05 09:00:00
2 1184695414 2018-01-05 19:00:00
3 1184695414 2018-01-05 22:00:00
4 1184695414 2018-01-21 14:00:00
5 1184695414 2018-01-21 16:00:00
6 1184695414 2018-01-21 18:00:00
7 1184695414 2018-01-23 10:00:00
期望的输出
fake_id date
1 1184695414 2018-01-05 09:00:00
2 1184695414 2018-01-21 14:00:00
3 1184695414 2018-01-23 10:00:00
【问题讨论】:
-
你为什么混合
data.table和dplyr语法? -
如果你能展示你的预期输出会是什么样子会很有用吗?
-
我现在不在电脑旁,但最近我提供了一个类似问题的答案。您可能想看看它,看看您是否可以应用一些功能,特别是
data.table::foverlaps和非等连接。 stackoverflow.com/questions/57876463/… -
@MauritsEvers,对不起!作为 dplyr 的忠实粉丝,这是一种习惯力量。
-
@RonakShah,希望这会有所帮助。谢谢!
标签: r join data.table