【问题标题】:data.table R: instants in which two events occur within a specific time epochdata.table R:在特定时间段内发生两个事件的瞬间
【发布时间】:2019-09-12 23:16:20
【问题描述】:

我需要识别在特定时间段内发生两个事件的实例,如下所示。如果事件 A 先发生,则事件 B 必须在 24 小时内发生。另一方面,如果 B 先出现,则需要在 72 小时内找到 A。此外,当满足条件时,我需要“开始”时间,即第一个事件发生的时间。

事件 A

structure(list(fake_id = c("1000686267", "1000686267", "1000686267", 
"1000686267", "1000686267", "1000686267", "1000686267", "1070640921", 
"1070640921", "1070640921", "1070640921", "1070640921", "1070640921", 
"1184695414", "1184695414", "1184695414", "1184695414", "1184695414"
), date = structure(c(1515063600, 1514822400, 1514822400, 1514822400, 
1514822400, 1515146400, 1514901600, 1515330000, 1514822400, 1514822400, 
1514822400, 1514822400, 1517385600, 1516701600, 1515142800, 1515178800, 
1515178800, 1516557600), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
-18L), class = "data.frame", .Names = c("fake_id", 
"date"))

事件 B

structure(list(fake_id = c("1000686267", "1000686267", "1000686267", 
"1000686267", "1000686267", "1000686267", "1000686267", "1000686267", 
"1000686267", "1000686267", "1000686267", "1000686267", "1000686267", 
"1000686267", "1000686267", "1000686267", "1000686267", "1070640921", 
"1070640921", "1070640921", "1070640921", "1070640921", "1070640921", 
"1184695414", "1184695414", "1184695414", "1184695414", "1184695414", 
"1184695414", "1184695414"), date = structure(c(1516795200, 1516795200, 
1516795200, 1516917600, 1517400000, 1517400000, 1515492000, 1515492000, 
1516190400, 1516190400, 1517410800, 1517410800, 1516921200, 1515070800, 
1515070800, 1515052800, 1516633200, 1517374800, 1515322800, 1515322800, 
1516525200, 1515232800, 1516543200, 1516550400, 1515189600, 1516543200, 
1516543200, 1515142800, 1515142800, 1515142800), class = c("POSIXct", 
"POSIXt"), tzone = "UTC")), row.names = c(NA, -30L), class = "data.frame", .Names = c("fake_id", 
"date"))

一些代码


 library (data.table)

 event_a <- data.table(event_a[, c("fake_id", "date"), with = FALSE])
 event_b <- data.table(event_b[, c("fake_id", "date"), with = FALSE])

 event_a[, `:=`("criteria_a", "criteria_a")]
 event_b[, `:=`("criteria_b", "criteria_b")]

 setkeyv(event_a, c("fake_id", "date"))
 setkeyv(event_b, c("fake_id", "date"))

 join_window <- 60 * 60 * c(24, 72)

 event_subset_a <- event_a[event_b, roll = join_window[1]]
 event_subset_b <- event_b[event_a, roll = join_window[2]]

 event_df <- rbind(event_subset_a, event_subset_b)
 event_df[, `:=`(c("criteria_a", "criteria_b"),  NULL)]

 setkeyv(event_df, c("fake_id", "date"))
 event_df <- unique(event_df)

电流输出

      fake_id                date
1  1184695414 2018-01-05 09:00:00
2  1184695414 2018-01-05 19:00:00
3  1184695414 2018-01-05 22:00:00
4  1184695414 2018-01-21 14:00:00
5  1184695414 2018-01-21 16:00:00
6  1184695414 2018-01-21 18:00:00
7  1184695414 2018-01-23 10:00:00

期望的输出

      fake_id                date
1  1184695414 2018-01-05 09:00:00
2  1184695414 2018-01-21 14:00:00
3  1184695414 2018-01-23 10:00:00

【问题讨论】:

  • 你为什么混合data.tabledplyr语法?
  • 如果你能展示你的预期输出会是什么样子会很有用吗?
  • 我现在不在电脑旁,但最近我提供了一个类似问题的答案。您可能想看看它,看看您是否可以应用一些功能,特别是data.table::foverlaps 和非等连接。 stackoverflow.com/questions/57876463/…
  • @MauritsEvers,对不起!作为 dplyr 的忠实粉丝,这是一种习惯力量。
  • @RonakShah,希望这会有所帮助。谢谢!

标签: r join data.table


【解决方案1】:

起初我认为这个问题需要用非等连接来解决,但后来我意识到标准连接就足够了。

整个过程是这样的:

  1. 消除重复行
  2. 加入两个表
  3. 过滤条件 A 首先出现的那些。将它们标记为“A 型”并确定发病时间。
  4. 过滤条件 B 首先出现的那些。将它们标记为“B 型”,并确定发病时间。
  5. 删除未标记的行。

library(data.table)
library(lubridate)  # we'll use the dhours() function

setDT(eventA, key = "fake_id")
setDT(eventB, key = "fake_id")

修改列的名称,以便更容易理解属于哪里

setnames(eventA, "date", "dateA")
setnames(eventB, "date", "dateB")

消除重复行

eventA <- eventA[!duplicated(eventA), ]
eventB <- eventB[!duplicated(eventB), ]

加入两个表并通过链接执行总体计划的步骤 2 - 4

eventA[eventB, 
       allow.cartesian = TRUE][
          dateA < dateB & dateB <= dateA + dhours(24), 
          `:=` (type = "A", 
                onset = dateA)][
                    dateB < dateA & dateA <= dateB + dhours(72), 
                    `:=` (type = "B", 
                          onset = dateB)][!is.na(type), ][]

       fake_id               dateA               dateB type               onset
 1: 1000686267 2018-01-04 11:00:00 2018-01-04 08:00:00    B 2018-01-04 08:00:00
 2: 1000686267 2018-01-05 10:00:00 2018-01-04 08:00:00    B 2018-01-04 08:00:00
 3: 1000686267 2018-01-04 11:00:00 2018-01-04 13:00:00    A 2018-01-04 11:00:00
 4: 1000686267 2018-01-05 10:00:00 2018-01-04 13:00:00    B 2018-01-04 13:00:00
 5: 1070640921 2018-01-07 13:00:00 2018-01-06 10:00:00    B 2018-01-06 10:00:00
 6: 1070640921 2018-01-07 13:00:00 2018-01-07 11:00:00    B 2018-01-07 11:00:00
 7: 1070640921 2018-01-31 08:00:00 2018-01-31 05:00:00    B 2018-01-31 05:00:00
 8: 1184695414 2018-01-05 19:00:00 2018-01-05 09:00:00    B 2018-01-05 09:00:00
 9: 1184695414 2018-01-05 09:00:00 2018-01-05 22:00:00    A 2018-01-05 09:00:00
10: 1184695414 2018-01-05 19:00:00 2018-01-05 22:00:00    A 2018-01-05 19:00:00
11: 1184695414 2018-01-21 18:00:00 2018-01-21 14:00:00    B 2018-01-21 14:00:00
12: 1184695414 2018-01-23 10:00:00 2018-01-21 14:00:00    B 2018-01-21 14:00:00
13: 1184695414 2018-01-21 18:00:00 2018-01-21 16:00:00    B 2018-01-21 16:00:00
14: 1184695414 2018-01-23 10:00:00 2018-01-21 16:00:00    B 2018-01-21 16:00:00

输出与您的预期输出非常不同,但查看您的数据和您制定的规则(如果 A 早于 B 和 B 在 A 的 24 小时内,那么 A。如果 B 早于 A 和 A 在 72 h of B, 然后 B) 与您找到的匹配项有 11 个额外的匹配项(换句话说:要么您的预期输出错误,或者您的既定规则错误)。

【讨论】:

    【解决方案2】:

    这类似于@PavoDive,但侧重于在实际连接之前创建非等连接条件:

    library (data.table)
    setDT(event_a)
    setDT(event_b)
    
    # for the join - eventB needs to be within -72 to 24 hours
    event_a[, `:=`(min_date = date - 72*60*60,
                   max_date = date + 24*60*60)]
    
    # join unique data.tables
    unique(event_b)[unique(event_a),
               #non-equi join conditions
                    on = .(fake_id = fake_id,
                           date > min_date,
                           date < max_date),
                    nomatch = 0L,
                    allow.cartesian = T,
                #select columns - you would only include fake_id and onset for desired output
                    j = .(fake_id,
                          a_date = i.date,
                          b_date = x.date,
                          onset = pmin(i.date, x.date),
                          first_type = ifelse(i.date == x.date,
                                              NA_character_,
                                              ifelse(i.date < x.date,
                                                     'A',
                                                     'B'))
                          )
                    ]
    
           fake_id              a_date              b_date               onset first_type
     1: 1000686267 2018-01-04 11:00:00 2018-01-04 13:00:00 2018-01-04 11:00:00          A
     2: 1000686267 2018-01-04 11:00:00 2018-01-04 08:00:00 2018-01-04 08:00:00          B
     3: 1000686267 2018-01-05 10:00:00 2018-01-04 13:00:00 2018-01-04 13:00:00          B
     4: 1000686267 2018-01-05 10:00:00 2018-01-04 08:00:00 2018-01-04 08:00:00          B
     5: 1070640921 2018-01-07 13:00:00 2018-01-07 11:00:00 2018-01-07 11:00:00          B
     6: 1070640921 2018-01-07 13:00:00 2018-01-06 10:00:00 2018-01-06 10:00:00          B
     7: 1070640921 2018-01-31 08:00:00 2018-01-31 05:00:00 2018-01-31 05:00:00          B
     8: 1184695414 2018-01-23 10:00:00 2018-01-21 16:00:00 2018-01-21 16:00:00          B
     9: 1184695414 2018-01-23 10:00:00 2018-01-21 14:00:00 2018-01-21 14:00:00          B
    10: 1184695414 2018-01-05 09:00:00 2018-01-05 22:00:00 2018-01-05 09:00:00          A
    11: 1184695414 2018-01-05 09:00:00 2018-01-05 09:00:00 2018-01-05 09:00:00       <NA>
    12: 1184695414 2018-01-05 19:00:00 2018-01-05 22:00:00 2018-01-05 19:00:00          A
    13: 1184695414 2018-01-05 19:00:00 2018-01-05 09:00:00 2018-01-05 09:00:00          B
    14: 1184695414 2018-01-21 18:00:00 2018-01-21 16:00:00 2018-01-21 16:00:00          B
    15: 1184695414 2018-01-21 18:00:00 2018-01-21 14:00:00 2018-01-21 14:00:00          B
    

    输出的不同是在第 11 行,开始时间是一样的。我的加入条件没有捕捉到这一点,因为 data.table 目前不支持不相等。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-02-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-05-26
      • 1970-01-01
      • 1970-01-01
      • 2017-02-27
      相关资源
      最近更新 更多