【问题标题】:R - reduce with merge and more than 2 suffixes (or: how to merge multiple dataframes and keep track of columns)R - 通过合并和超过 2 个后缀减少(或:如何合并多个数据帧并跟踪列)
【发布时间】:2018-02-15 04:29:00
【问题描述】:

我正在尝试基于 2 列合并 4 个数据框,但要跟踪列源自哪个数据框。我在跟踪列时遇到了问题。

(见 dput(dfs) 文章结尾)

#df example (df1)
Name    Color    Freq
banana  yellow   3
apple   red      1
apple   green    4
plum    purple   8


#create list of dataframes
list.df <- list(df1, df2, df3, df4)

#merge dfs on column "Name" and "Color"
combo.df <- Reduce(function(x,y) merge(x,y, by = c("Name", "Color"), all = TRUE, accumulate=FALSE, suffixes = c(".df1", ".df2", ".df3", ".df4")), list.df)

这会给出以下警告:

警告信息: 在 merge.data.frame(x, y, by = c("Name", "Color"), all = TRUE, : 列名“Freq.df1”、“Freq.df2”在结果中重复

并输出此数据框:

#combo df example
Name    Color    Freq.df1   Freq.df2  Freq.df1  Freq.df2
banana  yellow   3          3         7         NA
apple   red      1          2         9         1
apple   green    4          NA        8         2
plum    purple   8          1         NA        6

df1df2 仅在名称中重复。填充combo 的第三和第四列的值实际上分别来自df3df4

我真正想要的是:

Name    Color    Freq.df1   Freq.df2  Freq.df3  Freq.df4
banana  yellow   3          3         7         NA
apple   red      1          2         9         1
apple   green    4          NA        8         2
plum    purple   8          1         NA        6

我怎样才能做到这一点?我知道 merge(..., suffixes) 函数只能处理 2 的字符向量,但我不知道应该如何解决。谢谢!

df1 <- 
structure(list(Name = structure(c(2L, 1L, 1L, 3L), .Label = c("apple", 
"banana", "plum"), class = "factor"), Color = structure(c(4L, 
3L, 1L, 2L), .Label = c("green", "purple", "red", "yellow"), class = "factor"), 
    Freq = c(3, 1, 4, 8)), .Names = c("Name", "Color", "Freq"
), row.names = c(NA, -4L), class = "data.frame")

df2 <-
structure(list(Name = structure(c(2L, 1L, 3L), .Label = c("apple", 
"banana", "plum"), class = "factor"), Color = structure(c(3L, 
2L, 1L), .Label = c("purple", "red", "yellow"), class = "factor"), 
    Freq = c(3, 2, 1)), .Names = c("Name", "Color", "Freq"), row.names = c(NA, 
-3L), class = "data.frame")

df3 <-
structure(list(Name = structure(c(2L, 1L, 1L), .Label = c("apple", 
"banana"), class = "factor"), Color = structure(c(3L, 2L, 1L), .Label = c("green", 
"red", "yellow"), class = "factor"), Freq = c(7, 9, 8)), .Names = c("Name", 
"Color", "Freq"), row.names = c(NA, -3L), class = "data.frame")

df4 <-
structure(list(Name = structure(c(1L, 1L, 2L), .Label = c("apple", 
"plum"), class = "factor"), Color = structure(c(3L, 1L, 2L), .Label = c("green", 
"purple", "red"), class = "factor"), Freq = c(1, 2, 6)), .Names = c("Name", 
"Color", "Freq"), row.names = c(NA, -3L), class = "data.frame")

【问题讨论】:

  • 你能用dput分享所有4个data.frames吗?
  • @TUSHAr - 在帖子中编辑
  • 这很棘手。不确定在合并进行时是否可以优雅地跟踪它。我们所能做的就是将data.frame 的名称作为外部值以与我们期望合并发生的顺序相同的顺序传递。

标签: r dataframe merge


【解决方案1】:

使用for 循环似乎更容易,因为Reducereduce (purrr) 一次只需要两个数据集,所以我们不能有超过两个suffixesmerge.

在这里,我们创建了一个后缀向量 ('sfx')。使用第一个 list 元素初始化输出数据集。然后循环遍历“list.df”的序列,并在每个步骤中更新“res”时,使用“res”和list.df的下一个元素执行连续的merge

sfx <- c(".df1", ".df2", ".df3", ".df4")
res <- list.df[[1]]
for(i in head(seq_along(list.df), -1)) {

 res <- merge(res, list.df[[i+1]], all = TRUE, 
                 suffixes = sfx[i:(i+1)], by = c("Name", "Color"))
  }

res
#    Name  Color Freq.df1 Freq.df2 Freq.df3 Freq.df4
#1  apple  green        4       NA        8        2
#2  apple    red        1        2        9        1
#3 banana yellow        3        3        7       NA
#4   plum purple        8        1       NA        6

【讨论】:

    【解决方案2】:

    我终于可以使用Reduce 函数本身来完成这项工作。为此,我以特定格式修改了输入。

    由于我们无法将data.frame 的名称作为参数传递给Reduce 函数,因此我创建了一个列表,其属性n 包含data.frame 的名称。

    lst=list(list(n="df1",df=df1),list(n="df2",df=df2),list(n="df3",df=df3), list(n="df4",df=df4))
    

    围绕这一点,我构建了一个逻辑来跟踪正在处理的data.frames 的名称。

    Reduce(function(x,y){
        if(ncol(x$df)==3){
          #df column names after 1st merge.
          namecol=c('Name','Color',paste0("Freq.",x$n),paste0("Freq.",y$n))
        }else{
            #df column names for remaining merges.
            namecol=c(colnames(x$df),paste0("Freq.",y$n))
        }
        df=merge.data.frame(x = x$df,y = y$df,by = c("Name","Color"),all = TRUE)
        colnames(df)=namecol
        list(n="df",df=df)},lst)
    
    
    #$n
    #[1] "df"
    
    #$df
    #    Name  Color Freq.df1 Freq.df2 Freq.df3 Freq.df4
    #1  apple  green        4       NA        8        2
    #2  apple    red        1        2        9        1
    #3 banana yellow        3        3        7       NA
    #4   plum purple        8        1       NA        6
    

    【讨论】:

      【解决方案3】:

      我的包safejoin的函数eat有这样的功能,如果你给 它是一个命名的 data.frames 列表作为第二个输入,它将加入它们 递归到带有此名称的新列前缀的第一个输入。 我们将不得不单独重命名。

      # devtools::install_github("moodymudskipper/safejoin")
      library(safejoin)
      library(dplyr)
      eat(rename(df1,df1_Freq = Freq), lst(df2,df3,df4),
          .by = c("Name","Color"), .mode= "full",.check="")
      #     Name  Color df1_Freq df2_Freq df3_Freq df4_Freq
      # 1 banana yellow        3        3        7       NA
      # 2  apple    red        1        2        9        1
      # 3  apple  green        4       NA        8        2
      # 4   plum purple        8        1       NA        6
      

      .mode = "full" 是进行完全外连接,虽然这里是默认的(左连接),但结果相同。

      .check = "" 是删除检查,这会警告因素在连接列之间具有不同的级别。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2020-10-15
        • 2020-03-12
        • 2017-07-21
        • 1970-01-01
        • 1970-01-01
        • 2013-12-17
        • 2013-05-16
        • 2020-02-29
        相关资源
        最近更新 更多