【问题标题】:How can I filter for most recent occurrence within a time window?如何过滤时间窗口内最近发生的事件?
【发布时间】:2020-10-29 19:38:09
【问题描述】:

我有一个包含时间、ID、两种事件类型(A 和 B)以及(当前)空白的共现列的数据流。我想浏览数据集,对于每个 B 事件,检查前 5 秒内是否有 A。如果是这样,则该 A 事件行将在其共现列中接收来自 B 事件的 ID。在极少数情况下存在多个,第二个共现会添加到第二列(或者两者都可以进入同一列以便稍后处理)。

我可以使用循环和一些逻辑来实现大部分所需的结果,但有时会在 A 的 5 秒内出现多个 B,或者在 B 之前的 5 秒内出现多个 A,因此使用当前第 -1 行没有捕获这些。

示例数据流如下所示:

Time     ID  Event Co1 Co2
7:47:28  X1  A
7:47:30  X2  B
7:48:02  X3  A
7:48:04  X4  A
7:48:05  X5  B
7:50:11  X1  A
7:50:12  X2  B
7:50:15  X5  B
7:55:50  X6  A
7:55:52  X2  B

通过正确的处理应该会产生:

Time     ID  Event Co1 Co2
7:47:28  X1  A     X2
7:47:30  X2  B
7:48:02  X3  A     X5
7:48:04  X4  A     X5
7:48:05  X5  B
7:50:11  X1  A     X2  X5
7:50:12  X2  B
7:50:15  X5  B
7:55:50  X6  A     X2
7:55:52  X2  B

任何正确方向的帮助或指示将不胜感激!

【问题讨论】:

  • 您的 ID 应该是唯一的吗?否则不清楚将其放入 col1 和 col2 的目的是什么
  • ID 是唯一的,但是对于这部分处理,ID 是转到 Co1 还是 Co2 或一起发送到单个列中并不重要,例如“X2X5”。我可以在稍后阶段拆分和移动它们

标签: r database filtering


【解决方案1】:

鉴于您的意见:

df <- read.table(text = "Time     ID  Event
7:47:28  X1  A
7:47:30  X2  B
7:48:02  X3  A
7:48:04  X4  A
7:48:05  X5  B
7:50:11  X1  A
7:50:12  X2  B
7:50:15  X5  B
7:55:50  X6  A
7:55:52  X2  B", header = TRUE)

# convert to HMS
df$Time <- lubridate::hms(df$Time)

您可以使用slide_index_dfr 捕获提前5 秒的BIDs 并将其设置为数据帧。然后您可以更改名称并将其添加回您的df

xx <- slider::slide_index_dfr(df, df$Time, ~if(.$Event[1] == "A") .$ID[.$Event == "B"] else character(), .after = 5)
colnames(xx) <- paste0("Col", seq_len(ncol(xx)))
cbind(df, xx)
#>          Time ID Event Col1 Col2
#> 1  7H 47M 28S X1     A   X2 <NA>
#> 2  7H 47M 30S X2     B <NA> <NA>
#> 3   7H 48M 2S X3     A   X5 <NA>
#> 4   7H 48M 4S X4     A   X5 <NA>
#> 5   7H 48M 5S X5     B <NA> <NA>
#> 6  7H 50M 11S X1     A   X2   X5
#> 7  7H 50M 12S X2     B <NA> <NA>
#> 8  7H 50M 15S X5     B <NA> <NA>
#> 9  7H 55M 50S X6     A   X2 <NA>
#> 10 7H 55M 52S X2     B <NA> <NA>

【讨论】:

    【解决方案2】:

    这是来自data.table 包的foverlaps 函数的解决方案:

    library(data.table)
    dt <- read.table(text = "Time ID Event
    07:47:28 X1 A
    07:47:30 X2 B
    07:48:02 X3 A
    07:48:04 X4 A
    07:48:05 X5 B
    07:50:11 X6 A
    07:50:12 X7 B
    07:50:15 X8 B
    07:55:50 X9 A
    07:55:52 X10 B", header = TRUE, sep = " ", stringsAsFactors = FALSE)
    
    
    # Use data.table
    setDT(dt)
    
    
    # Join dataset to self over the 5 second lookback period
    dt[, time := as.ITime(Time)]
    dt[, time.lookback := time - as.ITime("00:00:05")]
    setkey(dt, time.lookback, time)
    dt.join <- foverlaps(dt, dt)
    dt.join <- dt.join[order(ID)]
    
    # You should be able to simplify this part a lot:
    dt.join <- dt.join[(Event == i.Event & time == i.time) | (Event == "A" & i.Event == "B" & time < i.time)]
    setorder(dt.join, ID, Event, -i.Event, i.time)
    dt.join[i.Event == "A", i.ID := NA]
    dt.join[i.Event == "A", i.Event := NA]
    dt.join[i.Event == "B" & time == i.time, i.ID := NA]
    dt.join[i.Event == "B" & time == i.time, i.Event := NA]
    dt.join[, rn := cumsum(i.Event == "B"), .(ID, Event)]
    
    # Now brining the dataset back to original granularity:
    res <- dcast(
      dt.join, 
      formula = ID + Event ~ paste0("col", rn), 
      value.var = "i.ID"
    )
    res$colNA <- NULL
    res
    #     ID Event col1 col2
    # 1:  X1     A   X2 <NA>
    # 2: X10     B <NA> <NA>
    # 3:  X2     B <NA> <NA>
    # 4:  X3     A   X5 <NA>
    # 5:  X4     A   X5 <NA>
    # 6:  X5     B <NA> <NA>
    # 7:  X6     A   X7   X8
    # 8:  X7     B <NA> <NA>
    # 9:  X8     B <NA> <NA>
    # 10:  X9     A  X10 <NA>
    

    【讨论】:

      猜你喜欢
      • 2022-01-07
      • 1970-01-01
      • 1970-01-01
      • 2020-11-02
      • 1970-01-01
      • 2023-02-17
      • 1970-01-01
      • 1970-01-01
      • 2014-01-24
      相关资源
      最近更新 更多