【问题标题】:How to merge time frame data with leaving NA for non-overlapped parts?如何将时间框架数据与非重叠部分的保留 NA 合并?
【发布时间】:2015-06-26 13:19:51
【问题描述】:

我有两个数据集(df1 和 df2),它们都由时间格式的值组成。我想做像“客观的”。通过 c("id1","id2") 合并两个数据时,我想在不重叠的时间内留下“NA”。

df1

id1    id2     click_timing 
 1      11     2015-02-03 01:00:00     
 1      11     2015-02-03 02:00:00     
 1      12     2015-02-03 03:00:00     
 1      12     2015-02-03 04:00:00     
 1      13     2015-02-03 05:10:00     
 2      34     2015-02-03 03:00:00     
 2      34     2015-02-03 04:00:00     
 2      36     2015-02-03 01:00:00
 ...     

df2

id1    id2     start                         end
 1      11     2015-02-03 00:20:00     2015-02-03 00:40:00
 1      11     2015-02-03 00:50:00     2015-02-03 01:20:00
 1      13     2015-02-03 01:10:00     2015-02-03 01:40:00     
 1      13     2015-02-03 04:50:00     2015-02-03 05:30:00     
 2      34     2015-02-03 03:50:00     2015-02-03 04:10:00     
 ...

目标输出

id1    id2     click_timing                start                 end
 1      11             NA             2015-02-03 00:20:00     2015-02-03 00:40:00
 1      11     2015-02-03 01:00:00    2015-02-03 00:50:00     2015-02-03 01:20:00
 1      11     2015-02-03 02:00:00            NA                  NA
 1      12     2015-02-03 03:00:00            NA                  NA
 1      12     2015-02-03 04:00:00            NA                  NA
 1      13             NA             2015-02-03 01:10:00     2015-02-03 01:40:00     
 1      13     2015-02-03 05:10:00    2015-02-03 04:50:00     2015-02-03 05:30:00
 2      34     2015-02-03 03:00:00            NA                  NA     
 2      34     2015-02-03 04:00:00     2015-02-03 03:50:00     2015-02-03 04:10:00
 2      36     2015-02-03 01:00:00            NA                  NA
 ...     

【问题讨论】:

  • 我已经尝试通过更改 all.x=T 和 all.y=T 来使用 merge(df1, df2,by=c("id1","id2"))。我不知道它为什么不起作用,但我想离开 NA 以获得不匹配的值。

标签: r merge dataset


【解决方案1】:

难题!我认为您必须通过手动循环遍历所有 click_timing 值来计算每个 click_timing 值和 每个 时间段(startend)之间的交集,然后使用结果索引匹配作为附加连接字段:

df1 <- data.frame(id1=c(1,1,1,1,1,2,2,2), id2=c(11,11,12,12,13,34,34,36), click_timing=as.POSIXct(c('2015-02-03 01:00:00','2015-02-03 02:00:00','2015-02-03 03:00:00','2015-02-03 04:00:00','2015-02-03 05:10:00','2015-02-03 03:00:00','2015-02-03 04:00:00','2015-02-03 01:00:00')) );
df2 <- data.frame(id1=c(1,1,1,1,2), id2=c(11,11,13,13,34), start=as.POSIXct(c('2015-02-03 00:20:00','2015-02-03 00:50:00','2015-02-03 01:10:00','2015-02-03 04:50:00','2015-02-03 03:50:00')), end=as.POSIXct(c('2015-02-03 00:40:00','2015-02-03 01:20:00','2015-02-03 01:40:00','2015-02-03 05:30:00','2015-02-03 04:10:00')) );
m <- sapply(1:nrow(df1), function(i) which(df1$id1[i]==df2$id1 & df1$id2[i] == df2$id2 & df1$click_timing[i]>=df2$start & df1$click_timing[i]<=df2$end)[1] );
merge(cbind(df1,m=m),cbind(df2,m=1:nrow(df2)),by=c('id1','id2','m'),all=T)[-3];
##    id1 id2        click_timing               start                 end
## 1    1  11                <NA> 2015-02-03 00:20:00 2015-02-03 00:40:00
## 2    1  11 2015-02-03 01:00:00 2015-02-03 00:50:00 2015-02-03 01:20:00
## 3    1  11 2015-02-03 02:00:00                <NA>                <NA>
## 4    1  12 2015-02-03 04:00:00                <NA>                <NA>
## 5    1  12 2015-02-03 03:00:00                <NA>                <NA>
## 6    1  13                <NA> 2015-02-03 01:10:00 2015-02-03 01:40:00
## 7    1  13 2015-02-03 05:10:00 2015-02-03 04:50:00 2015-02-03 05:30:00
## 8    2  34 2015-02-03 04:00:00 2015-02-03 03:50:00 2015-02-03 04:10:00
## 9    2  34 2015-02-03 03:00:00                <NA>                <NA>
## 10   2  36 2015-02-03 01:00:00                <NA>                <NA>

如果存在单个click_timing 值与多个startend 对相交的情况,那么此解决方案将选择较早出现的那个(即df2 中的行索引较低) 比其他匹配。

【讨论】:

    【解决方案2】:

    重新创建初始数据框并做一些小准备:

    library(data.table)
    library(lubridate)
    
    df1<- fread("id1,id2,click_timing
    1,11,2015-02-03 01:00:00
    1,11,2015-02-03 02:00:00
    1,12,2015-02-03 03:00:00
    1,12,2015-02-03 04:00:00
    1,13,2015-02-03 05:10:00
    2,34,2015-02-03 03:00:00
    2,34,2015-02-03 04:00:00
    2,36,2015-02-03 01:00:00")
    
    # adding a redundant click_timing2 column to use as the end range for further foverlaps() function
    df1[, click_timing2:= click_timing]
    df1[,c("click_timing", "click_timing2"):= list(parse_date_time(click_timing, "%Y-%m-%d %T"), parse_date_time(click_timing2, "%Y-%m-%d %T"))]
    
    
    df2<- fread("id1,id2,start,end
    1,11,2015-02-03 00:20:00,2015-02-03 00:40:00
    1,11,2015-02-03 00:50:00,2015-02-03 01:20:00
    1,13,2015-02-03 01:10:00,2015-02-03 01:40:00
    1,13,2015-02-03 04:50:00,2015-02-03 05:30:00
    2,34,2015-02-03 03:50:00,2015-02-03 04:10:00")
    
    df2[,c("start","end") := list(parse_date_time(start, "%Y-%m-%d %T"), parse_date_time(end, "%Y-%m-%d %T"))]
    setkey(df2, id1, id2, start, end)
    

    解决方案:

    df3<- foverlaps(df1, df2, by.x=c("id1", "id2", "click_timing", "click_timing2"), 
                              by.y = c("id1", "id2", "start", "end"), type="within")
    objective_output<- merge(df3, df2, by = c("id1", "id2", "start", "end"), all = T)
    # deleting redundant click_timing2 column
    objective_output[,click_timing2:= NULL]
    # reordering columns
    setcolorder(objective_output, c(1,2,5,3,4))
    #setting key using all columns and thus reordering all rows
    setkey(objective_output)
    objective_output
    #id1 id2        click_timing               start                 end
    # 1:   1  11 2015-02-03 02:00:00                <NA>                <NA>
    # 2:   1  11                <NA> 2015-02-03 00:20:00 2015-02-03 00:40:00
    # 3:   1  11 2015-02-03 01:00:00 2015-02-03 00:50:00 2015-02-03 01:20:00
    # 4:   1  12 2015-02-03 03:00:00                <NA>                <NA>
    # 5:   1  12 2015-02-03 04:00:00                <NA>                <NA>
    # 6:   1  13                <NA> 2015-02-03 01:10:00 2015-02-03 01:40:00
    # 7:   1  13 2015-02-03 05:10:00 2015-02-03 04:50:00 2015-02-03 05:30:00
    # 8:   2  34 2015-02-03 03:00:00                <NA>                <NA>
    # 9:   2  34 2015-02-03 04:00:00 2015-02-03 03:50:00 2015-02-03 04:10:00
    #10:   2  36 2015-02-03 01:00:00                <NA>                <NA>
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-05-27
      • 1970-01-01
      • 2021-03-11
      • 1970-01-01
      • 2019-03-02
      • 2021-03-12
      • 1970-01-01
      相关资源
      最近更新 更多