【问题标题】:Merging multiple dataframes by 3 common columns in R通过 R 中的 3 个公共列合并多个数据帧
【发布时间】:2020-10-15 11:18:08
【问题描述】:

我有 3 个数据框正在尝试合并/加入。我试过这两种解决方案: Merge multiple data.frames in R with varying row lengthMerge data.frames with duplicates。但是,输出数据表不是我想要的。

这是我的数据框的示例代码:

df1 <- data.frame(FzL = c(594.4014, 594.4147, 594.4148, 594.4194, 594.3877, 618.8600), task = c("hop", "hop", "hop", "vj", "vj", "vj"), 
                    limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))
df2 <- data.frame(FzR = c(594.2836, 619.1613, 618.8364, 594.4196, 694.3853, 640.2640), task = c("hop", "hop", "hop", "vj", "vj", "vj"), 
                    limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))
df3 <- data.frame(Frame = c(219388, 219389, 219390, 211387, 211388, 211389), Time = c("2020-06-05 13:26:39", "2020-06-05 13:26:39", "2020-06-05 13:26:39",
       "2020-06-05 13:26:39", "2020-06-05 13:26:39", "2020-06-05 13:26:39"),
       task = c("hop", "hop", "hop", "vj", "vj", "vj"), limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))

尝试与此代码合并时:

 JOIN <- merge(df3, merge(df1, df2, by = c("task", "limb", "trial"), all = TRUE), by = c("task", "limb", "trial"), all = TRUE)

我得到一个重复行多次的表。 我也试过这段代码:

run.seq <- function(x) as.numeric(ave(paste(x), x, FUN = seq_along))

L <- list(df1, df2, df3)
L2 <- lapply(L, function(x) cbind(x, run.seq = run.seq(x$limb)))

out <- Reduce(function(...) merge(..., all = TRUE), L2)

但是,它只给了我前 3 行,并没有贯穿整个数据集。

我的最终数据表应该有 7 列:任务、肢体、试验、FzL、FzR、帧、时间。

任何帮助将不胜感激!谢谢。

【问题讨论】:

  • 从哪里获得FzR 列?
  • @Ronak Shah 我错了,数据框 2 有 FzR
  • 您需要Reduce(merge, L) 吗?
  • 当我尝试: out
  • 为什么需要L2?您正在寻找的预期输出是什么?

标签: r join merge duplicates


【解决方案1】:

在合并中,函数不知道哪个FzL 值对应于哪个FzR 值。因此,它将创建每个可能的组合。

如果所有三个数据帧的确切顺序相同(即 df1 的第一行 for FzL of 594.4014 对应于 df2 的第一行 for FzR of 594.2836),那么您可以改为绑定列以将它们连接在一起(仅当您确定每一行对应于其他数据框中的相同行时)。

在这种情况下,列绑定可能就是您要查找的内容,因为在此示例中,每个数据帧中的行数和标识符都相同。

# Base R
df <- cbind(df1,
            subset(df2, select = c("FzR")),
            subset(df3, select = c("Frame", "Time")))

# Tidyverse
library(dplyr)
df <- df1 %>% 
  bind_cols(df2 %>% select(FzR)) %>% 
  bind_cols(df3 %>% select(Frame, Time))

在评论 df3 具有不同的行数后更新:

另一个选项是仍然合并,但如果所有数据帧的顺序相同,则可以使用行号来显示哪一行对应于哪一行。这是一种更简单的路线,其中一个数据帧的行数较少。

library(dplyr)

df1 <- df1 %>% 
  mutate(id = row_number())
df2 <- df2 %>% 
  mutate(id = row_number())
df3 <- df3 %>% 
  mutate(id = row_number())

df <- df1 %>% 
  full_join(df2) %>% 
  full_join(df3)

【讨论】:

  • 在我的实际数据集中,Frame 和 Time 列的行数没有 FzR 和 FzL 多,但每一行确实对应于另一个数据帧/列中的另一个对应行。这个解决方案还能用吗?另外,您能解释一下 cbind 与 bind_cols 解决方案之间的区别吗?我得到了每个相同的输出。谢谢!
  • 是的,我认为仍然可以工作,因为相应的顺序仍然正确。将调整对这两种解决方案的评论 - 这只是做同一件事的两种方式(基本 R 和 tidyverse 方式)......
  • 我收到一个错误:“data.frame 中的错误(...,check.names = FALSE):参数暗示不同的行数:18400、911”,因为 Frame 和 Time 列没有在我的真实数据集中没有尽可能多的行。是否有解决此问题的解决方案?谢谢!
  • 您必须扩展该数据框,然后将类别匹配为 df1 和 df2,但时间和帧使用 NA。有点hacky,但如果前面的行对应,那么应该可以工作 - 将尝试编辑答案
  • 我希望在 Frame 和 Time 列中填写 NA,直到 vj task、R 肢体和 trial2 试验列中的行再次匹配
【解决方案2】:

这是一个稍长的解决方案,FzLFzR 变量的每个值都对应于给定的行号,并且没有重复值。它是使用dplyr 包完成的。

library(dplyr)
df1 <- data.frame(FzL = c(594.4014, 594.4147, 594.4148, 594.4194, 594.3877, 618.8600), task = c("hop", "hop", "hop", "vj", "vj", "vj"), 
                  limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))
df2 <- data.frame(FzR = c(594.2836, 619.1613, 618.8364, 594.4196, 694.3853, 640.2640), task = c("hop", "hop", "hop", "vj", "vj", "vj"), 
                  limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))
df3 <- data.frame(Frame = c(219388, 219389, 219390, 211387, 211388, 211389), Time = c("2020-06-05 13:26:39", "2020-06-05 13:26:39", "2020-06-05 13:26:39",
                                                                                      "2020-06-05 13:26:39", "2020-06-05 13:26:39", "2020-06-05 13:26:39"),
                  task = c("hop", "hop", "hop", "vj", "vj", "vj"), limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))

df4 <- df1 %>% 
    left_join(df2, by = c("FzL" = "FzR"))
df4 <- df4[,-c(5:7)]
df4 <- df4 %>% 
    mutate(FzR = df2[ ,1])

df5 <- df4 %>% 
    left_join(df3, by = c("FzL" = "Frame"))
df5 <- df5[,-c(6:9)]
df5 <- df5 %>% 
    mutate(Frame = df3[ ,c(1)],
           Time = df3[ ,c(2)])
df5 <- df5 %>% 
    rename(task = task.x, limb = limb.x, trial = trial.x,) %>% 
    select(task, limb, trial, FzL, FzR, Frame, Time)
df5

输出如下:-

task   limb  trial      FzL      FzR  Frame                Time
1  hop    L trial1 594.4014 594.2836 219388 2020-06-05 13:26:39
2  hop    L trial1 594.4147 619.1613 219389 2020-06-05 13:26:39
3  hop    L trial1 594.4148 618.8364 219390 2020-06-05 13:26:39
4   vj    R trial2 594.4194 594.4196 211387 2020-06-05 13:26:39
5   vj    R trial2 594.3877 694.3853 211388 2020-06-05 13:26:39
6   vj    R trial2 618.8600 640.2640 211389 2020-06-05 13:26:39

【讨论】:

    猜你喜欢
    • 2022-11-28
    • 1970-01-01
    • 2021-12-30
    • 1970-01-01
    • 2016-05-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-05-11
    相关资源
    最近更新 更多