在 R 中基于时间范围条件连接两个数据帧答案

【问题标题】：Joining two dataframes based on a timeframe condition, in R在 R 中基于时间范围条件连接两个数据帧
【发布时间】：2021-10-25 04:21:16
【问题描述】：

我有一个 df：

df1 <- data.frame(date = c("2020-01-01", "2018-01-01"), A = c("5", NA), B = c("4", "0"))


       date    A B
 2020-01-01    5 4
 2018-01-01 <NA> 0

还有第二个df：

df2 <- data.frame(date = c("2020-05-16", "2018-09-23", "2017-02-02"), C = c("2", "3", "4"), D = c("9", "10", "11"))

       date C  D
 2020-05-16 2  9
 2018-09-23 3 10
 2017-02-02 4 11

我想加入两个 dfs，以便只有 df2 中在 df1 的日期之后且在 12 个月内的日期才加入到 df1。（同时保持 df1 的日期）。

此连接的结果应该如下所示：

df_result <- data.frame(date = c("2020-01-01", "2018-01-01"), A = c("5", NA), B = c("4", "0"), C = c("2", "3"), D = c("9", "10"))

       date    A B C  D
 2020-01-01    5 4 2  9
 2018-01-01 <NA> 0 3 10

如果日期相等，则使用 inner_join 很容易加入。但是，我不太确定如何使用不仅仅是 x = y 的条件进行内部连接。

任何帮助将不胜感激，谢谢！

【问题讨论】：

标签： r dataframe date dplyr

【解决方案1】：

将日期转换为 Date 类，然后使用指示的左连接。

library(sqldf)

df1 <- data.frame(date = c("2020-01-01", "2018-01-01"), 
  A = c("5", NA), B = c("4", "0"))
df2 <- data.frame(date = c("2020-05-16", "2018-09-23", "2017-02-02"), 
  C = c("2", "3", "4"), D = c("9", "10", "11"))

df1$date <- as.Date(df1$date)
df2$date <- as.Date(df2$date)
   
sqldf("select a.*, b.C, b.D
  from df1 a
  left join df2 b on b.date > a.date and b.date - a.date <= 365")
##         date    A B C  D
## 1 2020-01-01    5 4 2  9
## 2 2018-01-01 <NA> 0 3 10

在示例数据中，df1 的每一行都有一个匹配项，但如果可能有多个匹配项，我们只想要最少的匹配项

sqldf("select a.*, b.C, b.D, min(b.date - a.date) date_diff
  from df1 a
  left join df2 b on b.date > a.date and b.date - a.date <= 365
  group by a.rowid
  order by a.rowid")[-6]

添加

关于左下方的注释，在相同的条件下连接 df1 和 df2，但对于连接中的每个 df2 行，只保留最接近它的 df1 行，给出 mm。然后它左连接df1 到mm 以确保代表df1 的所有行。

library(sqldf)

# added third row to df1 as per comment
df1 <- data.frame(date = c("2020-01-01", "2018-01-01", "2020-02-02"), 
  A = c(5, NA, 1), B = c(4, 0, 1))
df2 <- data.frame(date = c("2020-05-16", "2018-09-23", "2017-02-02"), 
  C = c(2, 3, 4), D = c(9, 10, 11))
df1$date <- as.Date(df1$date)
df2$date <- as.Date(df2$date)

mm <- sqldf("select a.*, b.C, b.D, min(b.date - a.date) date_diff
  from df1 a
  left join df2 b on b.date > a.date and b.date - a.date <= 365
  group by b.rowid")
sqldf("select a.*, b.C, b.D
  from df1 a
  left join mm b using(date)")
    date  A B  C  D
## 1 2020-01-01  5 4 NA NA
## 2 2018-01-01 NA 0  3 10
## 3 2020-02-02  1 1  2  9

【讨论】：

感谢您的回答！假设 df2 中现在是否有两个日期（例如 2020-04-05 和 2020-05-06）与 df1 的 2020-01-01 相关。我怎样才能创建一个额外的行，以便有两行具有“2020-01-01”和适当的连接值？
第一个 sql 语句就是这样做的。
对不起，我问错了问题。我的意思是 df2 中的日期是否能够与最接近的 df1 日期（也在 12 个月内）加入。例如，如果 df1 有一个额外的“2020-02-02”，它将加入这个日期而不是“2020-01-01”
df1 中的两个不同行如果都满足连接条件，则可以与 df2 中的同一行连接。
我明白了，但是 df2 中的日期是否可以仅与 df1 中最近的日期连接？（即使 df1 中有多个日期满足该连接条件？）

【解决方案2】：

一个 dplyr 解决方案，尽管 G. Grothendiecks 回答它一流！！

library(dplyr)
df1 %>% 
    # generate all possible combinations (possibly RAM expensive)
    dplyr::full_join(df2, by = character()) %>%
    # convert specific sample data columns to date
    dplyr::mutate(across(contains("date"), as.Date)) %>%
    # subset the data (you could use other functions here but this is exact enough as we are lookin on months)
    dplyr::filter(date.y > date.x & date.y < (date.x + 365.25)) %>%
    # rename and select columns
    dplyr::select(date = 1, 2, 3, 5, 6)

        date    A B C  D
1 2020-01-01    5 4 2  9
2 2018-01-01 <NA> 0 3 10

关于您的问题：在这个 dplyr 管道中可以解决您请求的附加任务，尽管它不是最佳解决方案。

# modified df2 so it does not match any of df1 in your criteria
df1 <- data.frame(date = c("2020-01-01", "2018-01-01"), A = c("5", NA), B = c("4", "0"))
df2 <- data.frame(date = c("2017-02-02"), C = c("2"), D = c("11"))

library(dplyr)
df1 %>% 
    # generate all possible combinations (possibly RAM expensive)
    dplyr::full_join(df2, by = character()) %>%
    # convert specific sample data columns to date
    dplyr::mutate(across(contains("date"), as.Date)) %>%
    # subset the data (you could use other functions here but this is exact enough as we are lookin on months)
    dplyr::filter(date.y > date.x & date.y < (date.x + 365.25)) %>%
    # rename and select columns
    dplyr::select(date = 1, 2, 3, 5, 6) %>%
    # bind the missing data back by selecting from within the pipe
    dplyr::union(df1[!df1$date %in% unique(.$date),] %>% 
                     # we have to simulate the missing columns and correct the date as dplyr::union() need the columns to match by name and type
                     dplyr::mutate(date = as.Date(date),
                                   # you have to get the right NA_xxxx_ for you columns (critical!)
                                   C = NA_character_,
                                   D = NA_character_))

            date    A B    C    D
    1 2020-01-01    5 4 <NA> <NA>
    2 2018-01-01 <NA> 0 <NA> <NA>

【讨论】：

您好，感谢您的详细回答！有没有办法保留 df1 中的每一行，本质上是进行左连接？还只是想知道 df2 中的日期是否可以仅与 df1 中最近的日期连接？（即使 df1 中有多个日期满足 'after and within 12 months' 的条件）
有一种方法，但虽然你留在 dplyr 管道内但它不是很漂亮（请参阅更改后的答案）