【问题标题】:find the shortest time difference between two dataframes找到两个数据帧之间的最短时间差
【发布时间】:2016-05-13 10:22:39
【问题描述】:

假设我有两个数据框,

df1

id        time1
1         2016-04-07 21:39:10
1         2016-04-05 11:19:17
2         2016-04-03 10:58:25
2         2016-04-02 21:39:10

df2

id        time2
1         2016-04-07 21:39:11
1         2016-04-05 11:19:18
1         2016-04-06 21:39:11
1         2016-04-04 11:19:18
2         2016-04-03 10:58:26
2         2016-04-02 21:39:11
2         2016-04-04 10:58:26
2         2016-04-05 21:39:11

我想为 df1 中的每个条目找到 df2 中的最短时间差。假设我们取第一个条目,它的 id 为 1,所以我想遍历 df2,过滤 id 1,然后检查 df1 的一个条目与 df2 的其余条目之间的时间差,找到最短的差并获取相应的条目.我的示例输出应该是

id        time                   time2                    diff(in secs)
1         2016-04-07 21:39:10    2016-04-07 21:39:10        1
1         2016-04-05 11:19:17    2016-04-05 11:19:17        1
2         2016-04-03 10:58:25    2016-04-03 10:58:25        1
2         2016-04-02 21:39:10    2016-04-02 21:39:10        1

以下是我的尝试,

for(i in unique(df1$id)){
  temp1 = df1[df1$id == i,]
  temp2 = df2[df2$id == i,]
  for(j in unique(df1$time1){
     for(k in unique(df2$time2){
        diff = abs(df1$time1[j] - df2$time2[k]
        print(diff)}}}

在此之后我无法进步,出现很多错误。有人可以帮我纠正这个吗?可能会建议一种更有效的方法来做到这一点?任何帮助将不胜感激。

更新:

可重现的数据:

    df1 <- data.frame(
        id = c(1,1,2,2),
        time1 = c('2016-04-07 21:39:10', '2016-04-05 11:19:17', '2016-04-03 10:58:25', '2016-04-02 21:39:10')
    )

    df2 <- data.frame(
        id = c(1,1,1,1,2,2,2,2),
        time2 = c('2016-04-07 21:39:11', '2016-04-05 11:19:18','2016-04-07 21:39:11', '2016-04-05 11:19:18', '2016-04-03 10:58:26', '2016-04-02 21:39:11','2016-04-03 10:58:26', '2016-04-02 21:39:11')
    )

df1$time1 =  as.POSIXct(df1$time1)
df2$time2 = as.POSIXct(df2$time2)

【问题讨论】:

  • 能否添加生成df1df2的代码
  • ids 重要吗?听起来像是id中的最短差异@
  • @jaimedash yes 以及相应的时间
  • @Divi 会做的
  • 请使用dput提供数据。

标签: r dataframe greatest-n-per-group


【解决方案1】:

您可以使用dplyr 实现此目的。基本上这个想法是因为我们想要生成一个条目,我们将为df1 中的每个元素分配一个新的id(在这种情况下我只是将其称为rowname)。

在此之后,我们感兴趣的只是在id 上加入两个数据帧,并根据最小绝对差值过滤它们。

library(dplyr)

df1$time1 <- as.POSIXct(as.character(df1$time1))
df2$time2 <- as.POSIXct(as.character(df2$time2))

df1 %>% 
  add_rownames("rowname") %>%
  left_join(df2, "id") %>% 
  mutate(diff=time2-time1) %>%
  group_by(rowname) %>%
  filter(min(abs(diff)) == abs(diff)) %>% 
  distinct

这是我的输出:

Source: local data frame [4 x 5]
Groups: rowname [4]

  rowname    id               time1               time2   diff
    (chr) (dbl)              (time)              (time) (dfft)
1       1     1 2016-04-07 21:39:10 2016-04-07 21:39:11 1 secs
2       2     1 2016-04-05 11:19:17 2016-04-05 11:19:18 1 secs
3       3     2 2016-04-03 10:58:25 2016-04-03 10:58:26 1 secs
4       4     2 2016-04-02 21:39:10 2016-04-02 21:39:11 1 secs      

【讨论】:

    【解决方案2】:

    您也可以在 base R 中执行此操作。为了生成随机日期(有用),我从 elsewhere on StackOverflow 借用并编辑了一个不错的函数:

    latemail <- function(N, st="2011/01/01", et="2016/12/31") {
      st <- as.POSIXct(as.Date(st))
      et <- as.POSIXct(as.Date(et))
      dt <- as.numeric(difftime(et,st,unit="sec"))
      ev <- sort(runif(N, 0, dt))
      return(st + ev)
    }
    df1 <- data.frame(id=c(1,1,2,2), time1=latemail(4))
    df2 <- data.frame(id=c(rep(1,4), rep(2,4)), time2=latemail(8))
    

    然后你的答案可以通过两行来实现:

    shortest <- sapply(df1$time1, function(x) which(abs(df2$time2 - x) == min(abs(df2$time2 - x))))
    cbind(df1, df2[shortest,])
    

    输出:

    id               time1 id               time2
     1 2011-10-08 02:00:21  1 2011-08-17 18:07:47
     1 2012-05-06 17:49:03  1 2012-09-04 19:52:40
     2 2013-10-29 13:14:51  1 2012-10-29 20:09:31
     2 2016-06-17 19:23:43  2 2015-11-24 02:07:15
    

    【讨论】:

      【解决方案3】:

      如果您使用data.table

      library(data.table)
      df1 <- data.table(
        id = c(1,1,2,2),
        time1 = c('2016-04-07 21:39:10', '2016-04-05 11:19:17', '2016-04-03 10:58:25', '2016-04-02 21:39:10')
      )
      
      df2 <- data.table(
        id = c(1,1,1,1,2,2,2,2),
        time2 = c('2016-04-07 21:39:11', '2016-04-05 11:19:18','2016-04-07 21:39:11', '2016-04-05 11:19:18', '2016-04-03 10:58:26', '2016-04-02 21:39:11','2016-04-03 10:58:26', '2016-04-02 21:39:11')
      )
      
      df1$time1 =  as.POSIXct(df1$time1)
      df2$time2 = as.POSIXct(df2$time2)
      
      res <- df1[df2, .(time1, time2), by  = .EACHI, on = "id"][, diff:= abs(time2 -time1)]
      setkey(res, id, time1, diff)
      res <- res[, row := seq_along(.I), by = .(id, time1)][row == 1]
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-07-16
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-06-19
        • 1970-01-01
        相关资源
        最近更新 更多