【发布时间】:2018-07-06 17:30:16
【问题描述】:
我想通过对具有匹配变量的列求和(而不是附加列)将一组数据框组合成一个数据框。
例如,给定
df1 <- data.frame(A = c(0,0,1,1,1,2,2), B = c(1,2,1,2,3,1,5), x = c(2,3,1,5,3,7,0))
df2 <- data.frame(A = c(0,1,1,2,2,2), B = c(1,1,3,2,4,5), x = c(4,8,4,1,0,3))
df3 <- data.frame(A = c(0,1,2), B = c(5,4,2), x = c(5,3,1))
我想通过"A" 和"B" 进行匹配并对"x" 的值求和。对于这个例子,我可以得到想要的结果如下:
library(plyr)
library(dplyr)
# rename columns so that join_all preserves them all:
colnames(df1)[3] <- "x1"
colnames(df2)[3] <- "x2"
colnames(df3)[3] <- "x3"
# join the data frames by matching "A" and "B" values:
res <- join_all(list(df1, df2, df3), by = c("A", "B"), type = "full")
# get the sums and drop superfluous columns:
arrange(res, A, B) %>%
rowwise() %>%
mutate(x = sum(x1, x2, x3, na.rm = TRUE)) %>%
select(A, B, x)
结果:
A B x
<dbl> <dbl> <dbl>
1 0 1 6
2 0 2 3
3 0 5 5
4 1 1 9
5 1 2 5
6 1 3 7
7 1 4 3
8 2 1 7
9 2 2 2
10 2 4 0
11 2 5 3
更通用的解决方案是
library(dplyr)
# function to get the desired result for two data frames:
my_merge <- function(df1, df2)
{
m1 <- merge(df1, df2, by = c("A", "B"), all = TRUE)
m1 <- rowwise(res) %>%
mutate(x = sum(x.x, x.y, na.rm = TRUE)) %>%
select(A, B, x)
return(m1)
}
l1 <- list(df2, df3) # omit the first data frame
res <- df1 # initial value of the result
for(df in l1) res <- my_merge(res, df) # call the function repeatedly
有没有更有效的方法来组合大量数据框?理想情况下,它应该是递归的(即,在计算总和之前,最好不要将所有数据帧加入一个庞大的数据帧)。
【问题讨论】:
-
如果您说
merge或full_join内存效率更高,那没关系,但我认为rowwise和后来的sum效率低下。我会使用rowSums或reduce和+ -
很好,谢谢!所以我可以将
my_merge中的第二行替换为res <- res %>% mutate(x = rowSums(select(., x.x, x.y), na.rm = TRUE)) %>% select(A, B, x)(根据stackoverflow.com/questions/27354734/…)。