【问题标题】:Joining multiple data frames based on columns values using dplyr [duplicate]使用 dplyr 基于列值连接多个数据框 [重复]
【发布时间】:2021-06-09 02:39:33
【问题描述】:

我有如下三个相似的数据框:

df1<-data.frame(Campaign_Name=c("Z019","Z005","Z019","Z005","Z019"),
            Sunday_endwk=c("20190106","20190113","20190113","20190106","20190106"),
            Actual_Sales=c(12,2,5,10,12.11),
            Predictions=c(11.9,2.03,5.1,10.5,11.7),
            Version=c("layer_1","layer_1","layer_1","layer_1","layer_1"),
            Adj_Rsquared=c(0.85,0.85,0.85,0.85,0.85))
df1

    Campaign_Name Sunday_endwk Actual_Sales Predictions Version Adj_Rsquared
1          Z019     20190106        12.00       11.90 layer_1         0.85
2          Z005     20190113         2.00        2.03 layer_1         0.85
3          Z019     20190113         5.00        5.10 layer_1         0.85
4          Z005     20190106        10.00       10.50 layer_1         0.85
5          Z019     20190106        12.11       11.70 layer_1         0.85

同样,另外两个df是:

df2<-data.frame(Campaign_Name=c("Z019","Z019","Z005","Z005"),
                Sunday_endwk=c("20190106","20190113","20190106","20190113"),
                Actual_Sales=c(12.2,2.2,5.2,10.2),
                Predictions=c(11.8,2.05,5.4,10.1),
                Version=c("layer_2","layer_2","layer_2","layer_2"),
                Adj_Rsquared=c(0.88,0.88,0.88,0.88))  
#df2

df3<-data.frame(Campaign_Name=c("Z005","Z019","Z019","Z005","Z019"),
                Sunday_endwk=c("20190106","20190106","20190120","20190113","20190113"),
                Actual_Sales=c(12,2,5,10,12),
                Predictions=c(11.9,2.03,5.1,10.5,12.3),
                Version=c("layer_3","layer_3","layer_3","layer_3","layer_3"),
                Adj_Rsquared=c(0.82,0.82,0.82,0.82,0.82))
#df3

## expected output

我正在尝试基于Campaign_Name + Sunday_endwk 的组合将所有 3 个 dfs 合并并转换为宽格式(两者都应该在 3 个 dfs 中匹配通用),如下所示:

  Campaign_Name Sunday_endwk Actual_Sales_layer_1 Predictions_layer_1 Adj_Rsquared_layer_1 Actual_Sales_layer_2
1          Z019     20190106                   12               11.90                 0.85                 12.2
2          Z005     20190113                    2                2.03                 0.85                 10.2
3          Z019     20190113                    5                5.10                 0.85                  2.2
4          Z005     20190106                   10               10.50                 0.85                  5.2
  Predictions_layer_2 Adj_Rsquared_layer_2 Actual_Sales_layer_3 Predictions_layer_3 Adj_Rsquared_layer_3
1               11.80                 0.88                    2                2.03                 0.82
2               10.10                 0.88                   10               10.50                 0.82
3                2.05                 0.88                   12               12.30                 0.82
4                5.40                 0.88                   12               11.90                 0.82

如果 Campaign_Name + Sunday_endwk 的任何一个值都不存在于任何 df 中,则该行:

  1. 可以省略
  2. 与其他列的 NA 一起保留

同样在 df 中,Campaign_Name + Sunday_endwk 组合可能不是唯一的。

我们将不胜感激。

谢谢。

【问题讨论】:

  • Reduce(function(x, y) merge(x, y, by = c('Campaign_Name', 'Sunday_endwk')), list(df1, df2, df3)) ?
  • 这可行,但是如果数据框的行数增加,则执行需要很长时间。

标签: r join dplyr merge tidyr


【解决方案1】:
library(tidyverse)
bind_rows(df1, df2, df3, .id = "week") %>%
  rowid_to_column() %>%   # Added for nonunique combos of Camp/Sunday_endwk
  pivot_wider(c(Campaign_Name, Sunday_endwk, rowid), 
              names_from = week, values_from = Actual_Sales:Adj_Rsquared)

结果:

# A tibble: 5 x 14
  Campaign_Name Sunday_endwk Actual_Sales_1 Actual_Sales_2 Actual_Sales_3 Predictions_1 Predictions_2 Predictions_3 Version_1 Version_2 Version_3 Adj_Rsquared_1 Adj_Rsquared_2 Adj_Rsquared_3
  <chr>         <chr>                 <dbl>          <dbl>          <dbl>         <dbl>         <dbl>         <dbl> <chr>     <chr>     <chr>              <dbl>          <dbl>          <dbl>
1 Z019          20190106                 12           12.2              2         11.9          11.8           2.03 layer_1   layer_2   layer_3             0.85           0.88           0.82
2 Z005          20190113                  2           10.2             10          2.03         10.1          10.5  layer_1   layer_2   layer_3             0.85           0.88           0.82
3 Z019          20190113                  5            2.2             12          5.1           2.05         12.3  layer_1   layer_2   layer_3             0.85           0.88           0.82
4 Z005          20190106                 10            5.2             12         10.5           5.4          11.9  layer_1   layer_2   layer_3             0.85           0.88           0.82
5 Z019          20190120                 NA           NA                5         NA            NA             5.1  NA        NA        layer_3            NA             NA              0.82

【讨论】:

  • 我没有得到想要的结果,因为我最终得到了 list-col 形式的值。这可能是由于 df1 中的非唯一 Campaign_Name + Sunday_endwk 组合。
  • 尝试在bind_rows 之后添加rowid_to_column() 作为一个步骤,并将rowid 与Campaign_Name 和Sunday_endwk 一起添加为不要旋转的列。这样能解决吗?
  • 如果你能在代码中显示出来会有所帮助
  • 更新了
猜你喜欢
  • 2016-03-24
  • 1970-01-01
  • 2015-04-29
  • 2017-07-06
  • 2018-08-28
  • 2017-01-20
  • 2019-02-12
  • 1970-01-01
  • 2020-07-06
相关资源
最近更新 更多