dplyr >= 1.0.0
在较新版本的dplyr 中,您可以使用rowwise() 和c_across 对没有特定逐行变体但如果存在逐行变体的函数执行逐行聚合它应该比使用更快 rowwise(例如rowSums,rowMeans)。
由于rowwise() 只是一种特殊的分组形式,它改变了动词的工作方式,因此您可能希望在进行逐行操作后将其通过管道传递给ungroup()。
按名称选择范围:
df %>%
rowwise() %>%
mutate(sumrange = sum(c_across(x1:x5), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
按类型选择:
df %>%
rowwise() %>%
mutate(sumnumeric = sum(c_across(where(is.numeric)), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
按列名选择:
您可以使用任意数量的tidy selection helpers,例如starts_with、ends_with、contains等。
df %>%
rowwise() %>%
mutate(sum_startswithx = sum(c_across(starts_with("x")), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
按列索引选择:
df %>%
rowwise() %>%
mutate(sumindex = sum(c_across(c(1:4, 5)), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
rowise() 适用于任何摘要函数。但是,在您的特定情况下,存在逐行变体 (rowSums),因此您可以执行以下操作(请注意改用 across),这样会更快:
df %>%
mutate(sumrow = rowSums(across(x1:x5), na.rm = T))
有关更多信息,请参阅rowwise 上的页面。
基准测试
rowwise 使管道链非常可读,并且适用于较小的数据帧。但是,它的效率很低。
rowwise 与逐行变体函数
对于这个例子,逐行变体rowSums 要快得多:
library(microbenchmark)
set.seed(1)
large_df <- slice_sample(df, n = 1E5, replace = T) # 100,000 obs
microbenchmark(
large_df %>%
rowwise() %>%
mutate(sumrange = sum(c_across(x1:x5), na.rm = T)),
large_df %>%
mutate(sumrow = rowSums(across(x1:x5), na.rm = T)),
times = 10L
)
Unit: milliseconds
min lq mean median uq max neval cld
11108.459801 11464.276501 12144.871171 12295.362251 12690.913301 12918.106801 10 b
6.533301 6.649901 7.633951 7.808201 8.296101 8.693101 10 a
没有逐行变体函数的大型数据框
如果您的函数没有逐行变体并且您的数据框很大,请考虑使用长格式,它比rowwise 更有效。虽然可能有更快的非 tidyverse 选项,但这里有一个 tidyverse 选项(使用tidyr::pivot_longer):
library(tidyr)
tidyr_pivot <- function(){
large_df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = starts_with("x")) %>%
group_by(rn) %>%
summarize(std = sd(value, na.rm = T), .groups = "drop") %>%
bind_cols(large_df, .) %>%
select(-rn)
}
dplyr_rowwise <- function(){
large_df %>%
rowwise() %>%
mutate(std = sd(c_across(starts_with("x")), na.rm = T)) %>%
ungroup()
}
microbenchmark(dplyr_rowwise(),
tidyr_pivot(),
times = 10L)
Unit: seconds
expr min lq mean median uq max neval cld
dplyr_rowwise() 12.845572 13.48340 14.182836 14.30476 15.155155 15.409750 10 b
tidyr_pivot() 1.404393 1.56015 1.652546 1.62367 1.757428 1.981293 10 a
c_across 与cross
在 sum 函数的特定情况下,across 和 c_across 对上述大部分代码给出相同的输出:
sum_across <- df %>%
rowwise() %>%
mutate(sumrange = sum(across(x1:x5), na.rm = T))
sum_c_across <- df %>%
rowwise() %>%
mutate(sumrange = sum(c_across(x1:x5), na.rm = T)
all.equal(sum_across, sum_c_across)
[1] TRUE
c_across 的逐行输出是一个向量(因此是 c_),而 across 的逐行输出是一个 1 行的 tibble 对象:
df %>%
rowwise() %>%
mutate(c_across = list(c_across(x1:x5)),
across = list(across(x1:x5)),
.keep = "unused") %>%
ungroup()
# A tibble: 10 x 2
c_across across
<list> <list>
1 <dbl [5]> <tibble [1 x 5]>
2 <dbl [5]> <tibble [1 x 5]>
3 <dbl [5]> <tibble [1 x 5]>
4 <dbl [5]> <tibble [1 x 5]>
5 <dbl [5]> <tibble [1 x 5]>
6 <dbl [5]> <tibble [1 x 5]>
7 <dbl [5]> <tibble [1 x 5]>
8 <dbl [5]> <tibble [1 x 5]>
9 <dbl [5]> <tibble [1 x 5]>
10 <dbl [5]> <tibble [1 x 5]>
您要应用的功能将需要您使用哪个动词。如上面sum 所示,您几乎可以互换使用它们。但是,mean 和许多其他常用函数都需要一个(数字)向量作为其第一个参数:
class(df[1,])
"data.frame"
sum(df[1,]) # works with data.frame
[1] 4
mean(df[1,]) # does not work with data.frame
[1] NA
Warning message:
In mean.default(df[1, ]) : argument is not numeric or logical: returning NA
class(unname(unlist(df[1,])))
"numeric"
sum(unname(unlist(df[1,]))) # works with numeric vector
[1] 4
mean(unname(unlist(df[1,]))) # works with numeric vector
[1] 0.8
忽略均值 (rowMean) 存在的逐行变量,那么在这种情况下应使用 c_across:
df %>%
rowwise() %>%
mutate(avg = mean(c_across(x1:x5), na.rm = T)) %>%
ungroup()
# A tibble: 10 x 6
x1 x2 x3 x4 x5 avg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 1 1 0.8
2 0 1 1 0 1 0.6
3 0 NA 0 NA NA 0
4 NA 1 1 1 1 1
5 0 1 1 0 1 0.6
6 1 0 0 0 1 0.4
7 1 NA NA NA NA 1
8 NA NA NA 0 1 0.5
9 0 0 0 0 0 0
10 1 1 1 1 1 1
# Does not work
df %>%
rowwise() %>%
mutate(avg = mean(across(x1:x5), na.rm = T)) %>%
ungroup()
rowSums、rowMeans 等可以将数字数据框作为第一个参数,这就是它们使用 across 的原因。