【问题标题】:for loop to merge two data frames with common column Rfor循环将两个数据帧与公共列R合并
【发布时间】:2021-12-30 04:56:45
【问题描述】:

我需要根据代码列表在数据集中添加一些缺失值。我想通过运行一个循环结合列表上的公共列合并来做到这一点。

可能是 Merge in loop R 的副本,或者是特殊情况。


#load data
data("mtcars")
#add car names
mtcars <- cbind(cars = rownames(mtcars), mtcars)
rownames(mtcars) <- 1:nrow(mtcars)
#add dates and arrange
date <- rep(seq(as.Date("2015-01-02"), by = "month", length.out = 4),times = 8),
mtcars <- cbind(date = date, mtcars)
mtcars <- mtcars %>% 
  arrange(., date)
#add additional cars
add_cars <- c("renault", "dacia", "benz", "ferrari",
                "AC", "Acura", "Aixam", "Alfa",
                "Bertone", "Bestune", "Chevrolet",
                "Chrysler", "Haima", "Haval", "Hawtai", "Hennessey")
total_cars <- as_tibble(c(unique(mtcars$cars), add_cars))
colnames(total_cars) <-  "cars"
#split data on dates, list total cars
car_dates <- split(mtcars, f= mtcars$date)
total_cars <- as.list(total_cars)

#execute loop
results <- vector(mode = "integer", length = length(car_dates))
mylist <- list()

for (i in 1:length(car_dates)){
  g <- nrow(car_dates[[i]])
  results[i] <- g
  if (results[i] < 144){
    res <- list(merge(x = car_dates[[i]], y= total_cars,
                      by = c("cars"), all = T))
    mylist <- c(mylist, res)
    mydata_full <- as.data.frame(mylist)
  } 
}


这个循环收获是一个有 48 个 obs 的数据框。 52 个变量。这部分是我的目标。我得到了将缺失的观察结果添加到每个日期的循环,但它传播了数据集。现在对于每个日期,都会重复最初的 13 个变量。

我卡在这里,我只想要最初的 13 个变量,而不是长数据。


mydata_full <- as_tibble(mydata_full)
head(mydata_full)
# A tibble: 6 x 52
  cars     date         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb cars.1  date.1     mpg.1 cyl.1
  <chr>    <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>   <date>     <dbl> <dbl>
1 AC       NA            NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA AC      NA            NA    NA
2 Acura    NA            NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA Acura   NA            NA    NA
3 Aixam    NA            NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA Aixam   NA            NA    NA
4 Alfa     NA            NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA Alfa    NA            NA    NA
5 AMC Jav~ NA            NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA AMC Ja~ NA            NA    NA
6 benz     NA            NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA benz    NA            NA    NA
# ... with 35 more variables: disp.1 <dbl>, hp.1 <dbl>, drat.1 <dbl>, wt.1 <dbl>, qsec.1 <dbl>, vs.1 <dbl>,
#   am.1 <dbl>, gear.1 <dbl>, carb.1 <dbl>, cars.2 <chr>, date.2 <date>, mpg.2 <dbl>, cyl.2 <dbl>, disp.2 <dbl>,
#   hp.2 <dbl>, drat.2 <dbl>, wt.2 <dbl>, qsec.2 <dbl>, vs.2 <dbl>, am.2 <dbl>, gear.2 <dbl>, carb.2 <dbl>,
#   cars.3 <chr>, date.3 <date>, mpg.3 <dbl>, cyl.3 <dbl>, disp.3 <dbl>, hp.3 <dbl>, drat.3 <dbl>, wt.3 <dbl>,
#   qsec.3 <dbl>, vs.3 <dbl>, am.3 <dbl>, gear.3 <dbl>, carb.3 <dbl>


我确信这可以通过更简单的 full_join 来完成,我尝试过但仅在每个日期分别成功地进行了 full_join,我错过了什么?

#after rearranging the classes to tibble

mtcars_short <- mtcars %>%
  filter(date == "2015-02-02") %>%
  full_join(total_cars, by= c("cars"))

> print(mtcars_short)
# A tibble: 48 x 13
   date       cars                mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <date>     <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2015-02-02 Mazda RX4 Wag      21       6 160     110  3.9   2.88  17.0     0     1     4     4
 2 2015-02-02 Valiant            18.1     6 225     105  2.76  3.46  20.2     1     0     3     1
 3 2015-02-02 Merc 280           19.2     6 168.    123  3.92  3.44  18.3     1     0     4     4
 4 2015-02-02 Merc 450SLC        15.2     8 276.    180  3.07  3.78  18       0     0     3     3
 5 2015-02-02 Fiat 128           32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
 6 2015-02-02 Dodge Challenger   15.5     8 318     150  2.76  3.52  16.9     0     0     3     2
 7 2015-02-02 Fiat X1-9          27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
 8 2015-02-02 Ferrari Dino       19.7     6 145     175  3.62  2.77  15.5     0     1     5     6
 9 NA         Mazda RX4          NA      NA  NA      NA NA    NA     NA      NA    NA    NA    NA
10 NA         Hornet Sportabout  NA      NA  NA      NA NA    NA     NA      NA    NA    NA    NA

我想要 df 为 192 obs。和 13 个变量。每个唯一日期的含义 (4) 我想要所有的观察结果 (48)。


# A tibble: 48 x 52
   cars    date         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
   <chr>   <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
 1 AC      2015-01-02  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Acura   2015-01-02  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 
 3 Aixam   2015-01-02  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 
 4 Alfa    2015-01-02  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3 
 5 AMC Ja~ 2015-01-02  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4 
 6 benz    .            .    .    .    .    .    .    .    .    .    .    .     .     .
 7 Bertone .            .    .    .    .    .    .    .    .    .    .    .     .     .
 8 Bestune .            .    .    .    .    .    .    .    .    .    .    .     .     . 
 9 Cadill~ .            .    .    .    .    and so on    .    .    .     .     .  
10 Camaro~ .            .    .    .    .    .    .    .    .    .    .    .     .     .
.          date2
.          .
.          date3
.          etc.
.
192

任何意见将不胜感激!

【问题讨论】:

    标签: r for-loop dplyr merge


    【解决方案1】:

    一个简单的连接可以解决这个问题。创建一个包含两列的数据框。一个包含所有不同的汽车名称,其重复的数字与唯一日期相同,另一个包含不同的日期,每个日期重复不同的汽车数量。

    上面的数据框将如下所示:

               date              cars
      1: 2015-01-02         Mazda RX4
      2: 2015-01-02     Mazda RX4 Wag
      3: 2015-01-02        Datsun 710
      4: 2015-01-02    Hornet 4 Drive
      5: 2015-01-02 Hornet Sportabout
      ---                             
    188: 2015-04-02          Chrysler
    189: 2015-04-02             Haima
    190: 2015-04-02             Haval
    191: 2015-04-02            Hawtai
    192: 2015-04-02         Hennessey
    

    然后我们可以在这个表上执行左连接,以日期和汽车上的 mtcars 数据作为连接键。

    下面是尝试过的代码

    data("mtcars")
    #add car names
    mtcars <- cbind(cars = rownames(mtcars), mtcars)
    rownames(mtcars) <- 1:nrow(mtcars)
    
    date <- rep(seq(as.Date("2015-01-02"), by = "month", length.out = 4),times = 8)
    mtcars <- cbind(date = date, mtcars)
    
    #add additional cars
    add_cars <- c("renault", "dacia", "benz", "ferrari",
              "AC", "Acura", "Aixam", "Alfa",
              "Bertone", "Bestune", "Chevrolet",
              "Chrysler", "Haima", "Haval", "Hawtai", "Hennessey")
    total_cars <- c(unique(mtcars$cars), add_cars)
    
    total_cars <- data.frame(date = rep(sort(unique(mtcars$date)), each = length(total_cars)), cars = rep(total_cars, length(unique(mtcars$date))))
    
    total_cars <- merge(total_cars, mtcars, by = c('date', 'cars'), all.x = TRUE)
    

    示例输出行

              date             cars  mpg cyl  disp  hp drat    wt qsec vs am gear carb
    183 2015-04-02       Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.4  0  0    3    3
    184 2015-04-02       Merc 450SL   NA  NA    NA  NA   NA    NA   NA NA NA   NA   NA
    185 2015-04-02      Merc 450SLC   NA  NA    NA  NA   NA    NA   NA NA NA   NA   NA
    186 2015-04-02 Pontiac Firebird   NA  NA    NA  NA   NA    NA   NA NA NA   NA   NA
    187 2015-04-02    Porsche 914-2   NA  NA    NA  NA   NA    NA   NA NA NA   NA   NA
    188 2015-04-02          renault   NA  NA    NA  NA   NA    NA   NA NA NA   NA   NA
    189 2015-04-02   Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.9  1  1    4    1
    190 2015-04-02    Toyota Corona   NA  NA    NA  NA   NA    NA   NA NA NA   NA   NA
    191 2015-04-02          Valiant   NA  NA    NA  NA   NA    NA   NA NA NA   NA   NA
    192 2015-04-02       Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2
    

    【讨论】:

      【解决方案2】:

      经过几个小时的挖掘,终于找到了一个解决方案,太棒了!

      我在 Q 中找到了它:Convert a list to a data frame。 感谢@mflo-ByeSE 的评论,我在这里找到了解决方案:https://www.r-bloggers.com/2014/06/concatenating-a-list-of-data-frames/

      我修改了循环,因此列表元素将通过添加来获得日期名称

      names(res) <- names(car_dates[i])
      

      在循环中

      我将输出保留为删除列表

      mydata_full <- as.data.frame(mylist)
      

      下面的改进循环和解决方案

      
      #loop
      results <- vector(mode = "integer", length = length(car_dates))
      mylist <- list()
      
      for (i in 1:length(car_dates)){
        g <- nrow(car_dates[[i]])
        results[i] <- g
        if (results[i] < 144){
          res <- list(merge(x = car_dates[[i]], y= total_cars,
                            by = c("cars"), all = T))
          names(res) <- names(car_dates[i])
          mylist <- c(mylist, res)
        } 
      }
      
      #then
      mydata_full <- as_tibble(plyr::ldply(mylist, rbind))
      
      
      

      干杯

      【讨论】:

        猜你喜欢
        • 2023-03-13
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-09-26
        • 2022-11-28
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多