【问题标题】:Find nearest preceding and following dates between data frames查找数据框之间最近的前后日期
【发布时间】:2018-10-13 19:47:33
【问题描述】:

我有以下两个数据框:

df1 <- data.frame(ID = c("A","A","B","B","C","D","D","D","E"),
             Date = as.POSIXct(c("2018-04-12 08:56:00","2018-04-13 11:03:00","2018-04-14 14:30:00","2018-04-15 03:10:00","2018-04-16 07:28:00","2018-04-17 11:17:00","2018-04-17 14:21:00","2018-04-18 09:56:00","2018-05-02 07:49:00")))

df2 <- data.frame(ID = c("A","A","A","B","C","D","D","D","D","D","E"),
              Date = as.POSIXct(c("2018-04-10 07:11:00","2018-04-11 18:59:00","2018-04-12 12:37:00","2018-04-15 01:43:00","2018-04-21 09:52:00","2018-04-15 20:25:00","2018-04-17 12:33:00","2018-04-17 14:21:00","2018-04-18 10:59:00","2018-04-20 14:11:00","2018-05-01 09:50:00")))

对于 df1,我想做两件事: 首先,我想从 df2 中按 ID 查找最近的前一个日期。 其次,我想再次从 df2 按 ID 查找最近的下一个日期,而不重复值。在这两种情况下,我都不希望 df2 中的日期在 df1 中重复。

使用 data.table 包中的 roll = Inf 功能,我可以按 ID 合并前面的日期。

setDT(df1)
setDT(df2)

setkey(df1, ID, Date)
setkey(df2, ID, Date)[, PrecedingDate:=Date]

result <- df2[df1, roll=Inf]

我不确定如何将最近的下一个日期从 df2 提取到 df1,以及如何确保日期不重复。

结果应该如下:

result <- data.frame(ID = c("A","A","B","B","C","D","D","D","E"),
                     Date = as.POSIXct(c("2018-04-12 08:56:00","2018-04-13 11:03:00","2018-04-14 14:30:00","2018-04-15 03:10:00","2018-04-16 07:28:00","2018-04-17 11:17:00","2018-04-17 14:21:00","2018-04-18 09:56:00","2018-05-02 07:49:00")),
                     PrecedingDate = as.POSIXct(c("2018-04-11 18:59:00","2018-04-12 02:37:00",NA,"2018-04-15 01:43:00",NA,"2018-04-15 20:25:00","2018-04-17 14:21:00",NA,"2018-05-01 09:50:00")),
                     FollowingDate = as.POSIXct(c("2018-04-12 02:37:00",NA,"2018-04-15 01:43:00",NA,"2018-04-21 09:52:00","2018-04-17 12:33:00","2018-04-17 14:21:00","2018-04-18 10:59:00",NA)))

这里的任何帮助将不胜感激。

【问题讨论】:

  • 如果df2df1 具有相同的日期会发生什么?它被分类为前面或后面还是被忽略?
  • 在这些情况下,它应该被归类为仅在前面。
  • result 中的第二个 PrecedingDate 和第一个 FollowingDate 是不正确的 imo。他们应该都是2018-04-12 12:37:00。我已经在我的回答中纠正了这一点。

标签: r data.table


【解决方案1】:

这是使用dplyr 的解决方案。对于min max 函数,您可能会收到一些警告,但您可以放心地忽略或禁止它们。

library(dplyr)

closest_to_zero <- function(x) {
  neg <- which(x == max(x[x < 0]))
  pos <- which(x == min(x[x > 0]))
  c(previous = neg, following = pos)
}

result <- left_join(df1, df2, by = "ID") %>%
  group_by(ID, Date.x) %>%
  mutate(
    time_diff = Date.y - Date.x,
    Preceding = Date.y[closest_to_zero(time_diff)["previous"]],
    Following = Date.y[closest_to_zero(time_diff)["following"]]
  ) %>%
  distinct(ID, Date.x, Preceding, Following)

# A tibble: 9 x 4
# Groups:   ID, Date.x [9]
  ID    Date.x              Preceding           Following          
  <fct> <dttm>              <dttm>              <dttm>             
1 A     2018-04-12 08:56:00 2018-04-11 18:59:00 2018-04-12 12:37:00
2 A     2018-04-13 11:03:00 2018-04-12 12:37:00 NA                 
3 B     2018-04-14 14:30:00 NA                  2018-04-15 01:43:00
4 B     2018-04-15 03:10:00 2018-04-15 01:43:00 NA                 
5 C     2018-04-16 07:28:00 NA                  2018-04-21 09:52:00
6 D     2018-04-17 11:17:00 2018-04-15 20:25:00 2018-04-17 12:33:00
7 D     2018-04-17 14:21:00 2018-04-17 12:33:00 2018-04-18 10:59:00
8 D     2018-04-18 09:56:00 2018-04-17 14:21:00 2018-04-18 10:59:00
9 E     2018-05-02 07:49:00 2018-05-01 09:50:00 NA                 

【讨论】:

    【解决方案2】:

    的可能解决方案:

    df1[, PrecedingDate := df2[df1
                               , on = .(ID, Date <= Date)
                               , .(ID, Date = i.Date, pd = x.Date)
                               ][, .SD[.N], by = .(ID, Date)
                                 ][shift(pd) == pd, pd := NA][, pd]
        ][, FollowingDate := df2[df1
                                 , on = .(ID, Date >= Date)
                                 , .(ID, Date = i.Date, fd = x.Date)
                                 ][, .SD[1], by = .(ID, Date)][, fd]][]
    

    给出:

    > df1
       ID                Date       PrecedingDate       FollowingDate
    1:  A 2018-04-12 08:56:00 2018-04-11 18:59:00 2018-04-12 12:37:00
    2:  A 2018-04-13 11:03:00 2018-04-12 12:37:00                <NA>
    3:  B 2018-04-14 14:30:00                <NA> 2018-04-15 01:43:00
    4:  B 2018-04-15 03:10:00 2018-04-15 01:43:00                <NA>
    5:  C 2018-04-16 07:28:00                <NA> 2018-04-21 09:52:00
    6:  D 2018-04-17 11:17:00 2018-04-15 20:25:00 2018-04-17 12:33:00
    7:  D 2018-04-17 14:21:00 2018-04-17 14:21:00 2018-04-17 14:21:00
    8:  D 2018-04-18 09:56:00                <NA> 2018-04-18 10:59:00
    9:  E 2018-05-02 07:49:00 2018-05-01 09:50:00                <NA>
    

    这等于想要的结果:

    > all.equal(df1, as.data.table(result))
    [1] TRUE
    

    使用过的数据:

    df1 <- data.frame(ID = c("A","A","B","B","C","D","D","D","E"),
                      Date = as.POSIXct(c("2018-04-12 08:56:00","2018-04-13 11:03:00","2018-04-14 14:30:00","2018-04-15 03:10:00","2018-04-16 07:28:00","2018-04-17 11:17:00","2018-04-17 14:21:00","2018-04-18 09:56:00","2018-05-02 07:49:00")))
    df2 <- data.frame(ID = c("A","A","A","B","C","D","D","D","D","D","E"),
                      Date = as.POSIXct(c("2018-04-10 07:11:00","2018-04-11 18:59:00","2018-04-12 12:37:00","2018-04-15 01:43:00","2018-04-21 09:52:00","2018-04-15 20:25:00","2018-04-17 12:33:00","2018-04-17 14:21:00","2018-04-18 10:59:00","2018-04-20 14:11:00","2018-05-01 09:50:00")))
    result <- data.frame(ID = c("A","A","B","B","C","D","D","D","E"),
                         Date = as.POSIXct(c("2018-04-12 08:56:00","2018-04-13 11:03:00","2018-04-14 14:30:00","2018-04-15 03:10:00","2018-04-16 07:28:00","2018-04-17 11:17:00","2018-04-17 14:21:00","2018-04-18 09:56:00","2018-05-02 07:49:00")),
                         PrecedingDate = as.POSIXct(c("2018-04-11 18:59:00","2018-04-12 12:37:00",NA,"2018-04-15 01:43:00",NA,"2018-04-15 20:25:00","2018-04-17 14:21:00",NA,"2018-05-01 09:50:00")),
                         FollowingDate = as.POSIXct(c("2018-04-12 12:37:00",NA,"2018-04-15 01:43:00",NA,"2018-04-21 09:52:00","2018-04-17 12:33:00","2018-04-17 14:21:00","2018-04-18 10:59:00",NA)))
    

    【讨论】:

      猜你喜欢
      • 2019-12-23
      • 2019-04-28
      • 1970-01-01
      • 2016-01-13
      • 2021-07-19
      • 2020-11-06
      • 2017-06-30
      • 1970-01-01
      • 2020-10-02
      相关资源
      最近更新 更多