根据 dplyr 中多个数据帧中的值将列添加到数据帧答案

【问题标题】：Add column to dataframe according to values in multiple dataframes in dplyr根据 dplyr 中多个数据帧中的值将列添加到数据帧
【发布时间】：2020-02-29 13:32:50
【问题描述】：

我有一个数据框target，其中包含SNP 和value 列：

target <- data.frame("SNP" = c("rs2", "rs4", "rs6", "rs19", "rs8", "rs9"),
                     "value" = 1:6)

我还有 3 个其他数据框，其中包含 SNP 和 int 列作为列表：

ref1 <- data.frame("SNP" = c("rs1", "rs2", "rs8"), "int" = c(5, 7, 88))
ref2 <- data.frame("SNP" = c("rs9", "rs4", "rs3"), "int" = c(23, 4, 43))
ref3 <- data.frame("SNP" = c("rs10", "rs6", "rs5"), "int" = c(53, 22, 76))
mylist <- list(ref1, ref2, ref3)

我想为target 添加一个新列int，其值对应于具有相同SNP 的ref1/2/3 的int 值。例如，target 的第一个 int 值应该是 7，因为 ref1 的第 2 行具有 rs2 的 SNP 和 7 的 int。

我尝试了以下代码：

for (i in 1:3) {
    target <- target %>%
                left_join(mylist[[i]], by = "SNP")
}

匹配快速且成功。但是，我返回了 3 个新列而不是 1 个，如下所示：

然后我使用了以下代码：

target[, "ref"] <- NA
for (i in 1:3) {
    common <- Reduce(intersect, list(target$SNP, mylist[[i]]$SNP))

    tar.pos <- match(common, target$SNP)
    ref.pos <- match(common, mylist[[i]]$SNP)

    target$ref[tar.pos] <- mylist[[i]]$int[ref.pos]
}

在我的真实数据中，我有 22 个参考数据帧，每个数据帧有 1-6 百万行。我宁愿通过 ref 进行匹配和加入 ref，而不是将所有 ref 合并到一个大数据中。当我在我的真实数据上尝试上面的第二种方法时，我注意到match 函数工作得非常慢。这就是为什么我更喜欢一些聪明的工作方式。我发现 left_join 即使对于我的大数据也工作得非常快。不幸的是，输出并不是我想要的。

我希望快速完成上述工作，最好是在 tidyverse 中。关于如何修改第一种编码方法或任何其他更聪明的方法，有什么建议吗？

【问题讨论】：

标签： r dplyr

【解决方案1】：

如果绑定mylist中的所有数据合并到target占用内存太大，可以使用purrr::reduce一一合并。

library(tidyverse)

reduce(mylist,
       ~ left_join(.x, .y, by = "SNP") %>%
         mutate(int = coalesce(int.x, int.y)) %>%
         select(-c(int.x, int.y)),
       .init = mutate(target, int = NA_real_))

#    SNP value int
# 1  rs2     1   7
# 2  rs4     2   4
# 3  rs6     3  22
# 4 rs19     4  NA
# 5  rs8     5  88
# 6  rs9     6  23

【讨论】：

【解决方案2】：

有了tidyverse，我们也可以这样做

library(dplyr)
bind_rows(mylist) %>%
  right_join(target, by = "SNP")

【讨论】：

【解决方案3】：

您可以将mylist 转换为一个数据帧，然后将merge 转换为target

merge(target, do.call(rbind, mylist), by = "SNP", all.x = TRUE)

#   SNP value int
#1 rs19     4  NA
#2  rs2     1   7
#3  rs4     2   4
#4  rs6     3  22
#5  rs8     5  88
#6  rs9     6  23

或使用dplyr

library(dplyr)
left_join(target, bind_rows(mylist), by = "SNP")

或在data.table

library(data.table)
rbindlist(mylist)[target, on = 'SNP']

【讨论】：

谢谢。在我的真实数据中，我有 22 个参考数据，每个数据大约有 600 万行。绑定行肯定会起作用。但是为我的真实数据合并所有 ref 会占用太多内存。这就是为什么我想为每个参考数据单独处理它。匹配成功后，我将能够删除每个参考数据。
@Patrick 你是否事先知道mylist 的哪个元素会有哪些数据？或者你必须一个一个地遍历它们才能找到匹配项？
谢谢。我的target 有一个名为chromosome 的列，从1 到22。对于每个ref 数据，chromosome 值固定为一个值。这就是为什么我有 22 个参考数据。这对开发一些快速方法有帮助吗？
它如何反映您共享的数据？你可以试试data.table 版本：rbindlist(mylist)[target, on = 'SNP']