【问题标题】:Join two data frames by nearest match通过最近匹配加入两个数据帧
【发布时间】:2021-09-15 15:53:57
【问题描述】:

我有两个大型数据集,唯一的共享功能是数字时间戳。我想通过这个时间戳合并数据帧,但是数据收集的频率不完全匹配,所以我需要让它与最近的可能匹配合并。

作为一个简化的示例,这里有一个小数据集,其中包含一个值列、一些事件和一个 ID:

a<-c("150", "164", "175", "183", "195", "200", "205","213")
b<-c("start1","end1","start2", "end2", "start1", "end1", "start2", "end2")
c<-c("A","A","A", "A", "B", "B", "B", "B")

(data<-data.table(value = a, event = b, ID = c))

我希望能够通过值列将这个“数据”与这个数字系列(“次”)合并:

(times<-data.frame(value = c(seq(from = 150, to = 213, by = 3))))

以便它们通过 value 列中最接近的近似匹配合并以生成这个最终数据框:

agoal<-c(seq(from = 150, to = 213, by = 3))
bgoal<-c("start1","","","","","end1","", "",
     "start2", "", "", "end2", "", "", "",
     "start1", "", "end1", "start2", "", "", "end2")
cgoal<-c("A","","","","","A","", "",
         "A", "", "", "A", "", "", "",
         "B", "", "B", "B", "", "", "B")

(goal<-data.frame(value = agoal, event = bgoal, ID = cgoal))

有没有办法做到这一点,尤其是对于一个非常大的数据集(所以它不会使 R 崩溃)?

【问题讨论】:

  • 看看“fuzzyjoin”包。
  • 谢谢!这似乎有效(至少对于示例数据): end
  • data.table 提供滚动连接方法。一条线索stackoverflow.com/questions/35046161/… 例如data[times,roll = "nearest"](你需要先setkeyvalue

标签: r merge match numeric approximate


【解决方案1】:

要通过最接近的匹配加入而不用近似匹配填充空白,fuzzyjoin 效果很好!

(end<-fuzzyjoin::difference_left_join(times, data, by = "value", max_dist = 1, distance_col= "distance"))

【讨论】:

    【解决方案2】:

    data.table 提供滚动连接解决方​​案。

    library(data.table)
    setkey(data,value)
    setkey(times,value)
    data[times,roll = "nearest"]
    #    value  event ID
    # 1:   150 start1  A
    # 2:   153 start1  A
    # 3:   156 start1  A
    # 4:   159   end1  A
    # 5:   162   end1  A
    # 6:   165   end1  A
    # 7:   168   end1  A
    # 8:   171 start2  A
    # 9:   174 start2  A
    #10:   177 start2  A
    #11:   180   end2  A
    #12:   183   end2  A
    #13:   186   end2  A
    #14:   189   end2  A
    #15:   192 start1  B
    #16:   195 start1  B
    #17:   198   end1  B
    #18:   201   end1  B
    #19:   204 start2  B
    #20:   207 start2  B
    #21:   210   end2  B
    #22:   213   end2  B
    

    数据:

    a<-c("150", "164", "175", "183", "195", "200", "205","213")
    b<-c("start1","end1","start2", "end2", "start1", "end1", "start2", "end2")
    c<-c("A","A","A", "A", "B", "B", "B", "B")
    
    data<-data.table(value = as.numeric(a), event = b, ID = c)
    
    times<-data.table(value = c(seq(from = 150, to = 213, by = 3)))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-09-10
      • 2018-02-07
      • 1970-01-01
      • 1970-01-01
      • 2021-11-19
      • 2018-11-05
      • 2021-05-16
      相关资源
      最近更新 更多