R 中 rbind() 和 bind_rows() 的区别答案

【问题标题】：Difference between rbind() and bind_rows() in RR 中 rbind() 和 bind_rows() 的区别
【发布时间】：2017-08-10 18:20:15
【问题描述】：

在网上，我发现rbind()是用来合并两个数据框的，同样的任务是由bind_rows()函数完成的。

那我就不明白这两个函数有什么区别，哪个用起来效率更高？？

【问题讨论】：

它在bind_rows 中有额外的参数，如.id 等。 bind_rows 可以在 list 中绑定多个数据集，而 rbind 只能绑定 2 个数据集，除非您使用 do.call。关于效率，当你有时间时，在大型数据集上使用system.time 或microbenchmark 会更容易检查

标签： r rbind

【解决方案1】：

除了更多的区别之外，使用bind_rows 而不是rbind 的主要原因之一是组合具有不同列数的两个数据帧。在这种情况下，rbind 会引发错误，而 bind_rows 会将“NA”分配给其中一个数据框未提供值的数据框中缺少的列行。

试试下面的代码看看区别：

a <- data.frame(a = 1:2, b = 3:4, c = 5:6)
b <- data.frame(a = 7:8, b = 2:3, c = 3:4, d = 8:9)

两次调用的结果如下：

rbind(a, b)
> rbind(a, b)
Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

library(dplyr)
bind_rows(a, b)
> bind_rows(a, b)
  a b c  d
1 1 3 5 NA
2 2 4 6 NA
3 7 2 3  8
4 8 3 4  9

【讨论】：

【解决方案2】：

由于这里的答案都没有系统评价base::rbind 和dplyr::bind_rows 之间的差异，而且@bob 关于性能的答案不正确，我决定添加以下内容。

让我们有一些测试数据框：

df_1 = data.frame(
  v1_dbl = 1:1000,
  v2_lst = I(as.list(1:1000)),
  v3_fct = factor(sample(letters[1:10], 1000, replace = TRUE)),
  v4_raw = raw(1000),
  v5_dtm = as.POSIXct(paste0("2019-12-0", sample(1:9, 1000, replace = TRUE)))
)

df_1$v2_lst = unclass(df_1$v2_lst) #remove the AsIs class introduced by `I()`

1。 `base::rbind` 处理列表输入的方式不同

rbind(list(df_1, df_1))
     [,1]   [,2]  
[1,] List,5 List,5

# You have to combine it with `do.call()` to achieve the same result:
head(do.call(rbind, list(df_1, df_1)), 3)
  v1_dbl v2_lst v3_fct v4_raw     v5_dtm
1      1      1      b     00 2019-12-02
2      2      2      h     00 2019-12-08
3      3      3      c     00 2019-12-09

head(dplyr::bind_rows(list(df_1, df_1)), 3)
  v1_dbl v2_lst v3_fct v4_raw     v5_dtm
1      1      1      b     00 2019-12-02
2      2      2      h     00 2019-12-08
3      3      3      c     00 2019-12-09

2。 `base::rbind` 可以应付（某些）混合类型

虽然base::rbind 和dplyr::bind_rows 在尝试绑定时都失败了，例如。 raw 或 datetime 列转换为其他类型的列，base::rbind 可以处理某种程度的差异。

组合一个列表和一个非列表列会产生一个列表列。将一个因素和其他因素结合起来会产生警告，但不会产生错误：

df_2 = data.frame(
  v1_dbl = 1,
  v2_lst = 1,
  v3_fct = 1,
  v4_raw = raw(1),
  v5_dtm = as.POSIXct("2019-12-01")
)

head(rbind(df_1, df_2), 3)
  v1_dbl v2_lst v3_fct v4_raw     v5_dtm
1      1      1      b     00 2019-12-02
2      2      2      h     00 2019-12-08
3      3      3      c     00 2019-12-09
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 1) : invalid factor level, NA generated

# Fails on the lst, num combination:
head(dplyr::bind_rows(df_1, df_2), 3)
Error: Column `v2_lst` can't be converted from list to numeric

# Fails on the fct, num combination:
head(dplyr::bind_rows(df_1[-2], df_2), 3)
Error: Column `v3_fct` can't be converted from factor to numeric

3。 `base::rbind` 保留行名

Tidyverse 提倡将行名放入一个专用列，因此它的函数会删除它们。

rbind(mtcars[1:2, 1:4], mtcars[3:4, 1:4])
                mpg cyl disp  hp
Mazda RX4      21.0   6  160 110
Mazda RX4 Wag  21.0   6  160 110
Datsun 710     22.8   4  108  93
Hornet 4 Drive 21.4   6  258 110

dplyr::bind_rows(mtcars[1:2, 1:4], mtcars[3:4, 1:4])
   mpg cyl disp  hp
1 21.0   6  160 110
2 21.0   6  160 110
3 22.8   4  108  93
4 21.4   6  258 110

4。 `base::rbind` 无法处理缺少的列

为了完整起见，因为 Abhilash Kandwal 在他们的回答中已经说过了。

5。 `base::rbind` 以不同方式处理命名参数

base::rbind 将参数名称添加到行名之前，dplyr::bind_rows 可以选择添加专用 ID 列：

rbind(hi = mtcars[1:2, 1:4], bye = mtcars[3:4, 1:4])
                    mpg cyl disp  hp
hi.Mazda RX4       21.0   6  160 110
hi.Mazda RX4 Wag   21.0   6  160 110
bye.Datsun 710     22.8   4  108  93
bye.Hornet 4 Drive 21.4   6  258 110

dplyr::bind_rows(hi = mtcars[1:2, 1:4], bye = mtcars[3:4, 1:4], .id = "my_id")
  my_id  mpg cyl disp  hp
1    hi 21.0   6  160 110
2    hi 21.0   6  160 110
3   bye 22.8   4  108  93
4   bye 21.4   6  258 110

6。 `base::rbind` 将向量参数变成行（并回收它们）

相比之下，dplyr::bind_rows 添加了列（因此需要命名 x 的元素）：

rbind(mtcars[1:2, 1:4], x = 1:2))
              mpg cyl disp  hp
Mazda RX4      21   6  160 110
Mazda RX4 Wag  21   6  160 110
x               1   2    1   2

dplyr::bind_rows(mtcars[1:2, 1:4], x = c(a = 1, b = 2))
  mpg cyl disp  hp  a  b
1  21   6  160 110 NA NA
2  21   6  160 110 NA NA
3  NA  NA   NA  NA  1  2

7。 `base::rbind` 速度较慢，需要更多内存

要绑定一百个中等大小的数据帧（1k 行），base::rbind 需要 50 倍以上的 RAM，并且速度要慢 15 倍以上：

dfs = rep(list(df_1), 100)
bench::mark(
  "base::rbind" = do.call(rbind, dfs),
  "dplyr::bind_rows" = dplyr::bind_rows(dfs)
)[, 1:5]

# A tibble: 2 x 5
  expression            min   median `itr/sec` mem_alloc
  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 base::rbind       47.23ms  48.05ms      20.0  104.48MB
2 dplyr::bind_rows   3.69ms   3.75ms     261.     2.39MB

由于我需要绑定很多小数据帧，这里也有一个基准。两者的速度，尤其是 RAM 的差异非常显着：

dfs = rep(list(df_1[1:2, ]), 10^4)
bench::mark(
  "base::rbind" = do.call(rbind, dfs),
  "dplyr::bind_rows" = dplyr::bind_rows(dfs)
)[, 1:5]

# A tibble: 2 x 5
  expression            min   median `itr/sec` mem_alloc
  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 base::rbind         1.65s    1.65s     0.605    1.56GB
2 dplyr::bind_rows  19.31ms  20.21ms    43.7    566.69KB

最后，help("rbind") 和 help("bind_rows") 读起来也很有趣。

【讨论】：

我建议添加 bind_rows() 可以组合包含相同列名但顺序不同的数据框。例如bind_rows() 可以合并和附加 DF1：[Column1,Column2,Column3] 和 DF2：[Column3,Column1,Column2]，而rbind() 则需要使用它们具有相同的列序列
等等，这不对。你完全可以做例如rbind(mtcars[, 1:3], mtcars[, 3:1]) 它工作得很好，尽管列的顺序不同。

【解决方案3】：

虽然bind_rows() 在将数据帧与不同列数组合的意义上更实用（将NA 分配给缺少这些列的行），但如果您将数据帧与相同列组合，我会推荐rbind()。

rbind()计算效率更高在您组合的数据格式相同的情况下，当列数不同时它只会引发错误。它将为您节省大量用于大数据集的时间。对于这些情况，我强烈推荐rbind()。尽管如此，如果您的数据有不同的列，那么您必须使用bind_rows()。

【讨论】：

我刚刚对一个非常大的数据集（大约 9.17 亿行，13 列）进行了多次测试，可以确认 bind_rows 平均至少比 rbind 快 3-4 倍.与bind_cols 和cbind 相同

1。 base::rbind 处理列表输入的方式不同

2。 base::rbind 可以应付（某些）混合类型

3。 base::rbind 保留行名

4。 base::rbind 无法处理缺少的列

5。 base::rbind 以不同方式处理命名参数

6。 base::rbind 将向量参数变成行（并回收它们）

7。 base::rbind 速度较慢，需要更多内存

1。 `base::rbind` 处理列表输入的方式不同

2。 `base::rbind` 可以应付（某些）混合类型

3。 `base::rbind` 保留行名

4。 `base::rbind` 无法处理缺少的列

5。 `base::rbind` 以不同方式处理命名参数

6。 `base::rbind` 将向量参数变成行（并回收它们）

7。 `base::rbind` 速度较慢，需要更多内存