编辑说明:删除了我的答案中未解决 NA 处理的原始部分并添加了基准。
concat2 <- function(x) if(all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = ",")
使用data.table:
setDT(df)[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")]
# proid X1 X2 X3
#1: 1 zz,cd a,s e,f
#2: 2 ff,ta g,b z,h
#3: 3 NA t e
使用 dplyr:
df %>% group_by(proid) %>% summarise_each(funs(concat2), -X4)
基准测试,数据比实际用例小,不完全具有代表性,所以只是想了解一下concat2 与concat 等的比较。
library(microbenchmark)
library(dplyr)
library(data.table)
N <- 1e6
x <- c(letters, LETTERS)
df <- data.frame(
proid = sample(1e4, N, TRUE),
X1 = sample(sample(c(x, NA), N, TRUE)),
X2 = sample(sample(c(x, NA), N, TRUE)),
X3 = sample(sample(c(x, NA), N, TRUE)),
X4 = sample(sample(c(x, NA), N, TRUE))
)
dt <- as.data.table(df)
concat <- function(x){
x <- na.omit(x)
if(length(x)==0){
return(as.character(NA))
}else{
return(paste(x,collapse=","))
}
}
concat2 <- function(x) if(all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = ",")
concat.dplyr <- function(){
df %>% group_by(proid) %>% summarise_each(funs(concat), -X4)
}
concat2.dplyr <- function(){
df %>% group_by(proid) %>% summarise_each(funs(concat2), -X4)
}
concat.data.table <- function(){
dt[, lapply(.SD, concat), by = proid, .SDcols = -c("X4")]
}
concat2.data.table <- function(){
dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")]
}
microbenchmark(concat.dplyr(),
concat2.dplyr(),
concat.data.table(),
concat2.data.table(),
unit = "relative",
times = 10L)
Unit: relative
expr min lq median uq max neval
concat.dplyr() 1.058839 1.058342 1.083728 1.105907 1.080883 10
concat2.dplyr() 1.057991 1.065566 1.109099 1.145657 1.079201 10
concat.data.table() 1.024101 1.018443 1.093604 1.085254 1.066560 10
concat2.data.table() 1.000000 1.000000 1.000000 1.000000 1.000000 10
发现:data.table 在样本数据上的执行速度比 dplyr 快一点,concat2 比 concat 快一点。然而,这个样本数据集的差异仍然很小。