【发布时间】:2021-07-23 19:54:31
【问题描述】:
完全清楚这种类型问题已被问了数百次。
尽管如此,我还是找不到我所描述的具体问题的答案,即:
- 性能(即我知道如何做我需要的事情,但在某些情况下它太慢了,所以我正在寻找更快的解决方案)
- 良好的编程习惯(即我质疑我选择的方法是否“干净”而不是迂回或因其他原因效率低下)
我有一个带有数字和字符列的 data.frame。我需要从中创建一个 data.frame 摘要,按其中一个字符列 (ID) 分组,并报告 1) 每个组中一些数字列的一些统计信息,和 2) 一些字符连接(即报告具有混合 数据类型 - 这就是让它变得棘手的原因,至少对我而言,这也是我寻求建议的原因)。
这是R 脚本:
# Simulate original data.frame
set.seed(384092)
N <- 10000
d <- data.frame("ID" = paste0(sample(LETTERS, N, replace = T), sprintf("%03.0f", sample(1:floor(sqrt(N)), N, replace = T )) ), stringsAsFactors = F)
d["set"] <- sample(LETTERS, N, replace = T)
d["P"] <- runif(N, -20, 120)
d["K"] <- rnorm(N, 10, 0.5)
# Make summary
# For each unique ID, report: ID, number of rows of d, mean of P, sd of P, comma-separated list of unique set's
# Method 1: rbind data.frames from 'by'
time.1 <- system.time({
d_summary.1 <- do.call(rbind, by(d, d$ID, function(dd) {
data.frame("ID" = dd$ID[1], "N" = nrow(dd), "P_mean" = mean(dd$P), "P_sd" = sd(dd$P), "sets" = paste(unique(dd$set), collapse = ","))
})
)
})
cat("\ntime.1 =",time.1,"\n")
print(sapply(d_summary.1, class))
# Method 2: create a list of lists and combine them at the end
# https://stackoverflow.com/a/68162050/6376297
time.2 <- system.time({
time.2.1 <- system.time({d_summary.2 <- by(d, d$ID, function(dd) {
list("ID" = dd$ID[1], "N" = nrow(dd), "P_mean" = mean(dd$P), "P_sd" = sd(dd$P), "sets" = paste(unique(dd$set), collapse = ","))
})
})
d_summary.2 <- do.call(rbind, lapply(d_summary.2, data.frame))
})
cat("\ntime.2.1 =",time.2.1)
cat("\ntime.2 =",time.2,"\n")
print(sapply(d_summary.2, class))
在我的电脑上产生以下输出:
time.1 = 1.72 0 1.72 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.2.1 = 0.3 0 0.29 NA NA
time.2 = 1.79 0 1.82 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
链接的帖子https://stackoverflow.com/a/68162050/6376297 特别提到方法 2 中使用的处理类型对于避免将所有列强制为单一数据类型是必要的。
事实上,我尝试的任何依赖于制作中间矩阵的解决方案,正如完全预期的那样,都会导致对字符的强制。
这真的很不幸,因为如time.2.1 所示,包含所需信息的列表列表的初始形成(并且仍然保留所有原始数据类型)仅占总数的 1/6 - 1/5时间。
你需要想象一下,我在d 上做这个,至少比这个例子大 10-100 倍。
有人能建议/建议一种更快的方法吗?
谢谢!
编辑:跟进用户反馈
试用 dplyr (4) 和 data.table (5) 方法,以及更多基本的 R 方法(使用 aggregate、(6) 和 (7)),这些方法涉及更多但可能与这两者有一定的竞争力。
# Method 4: dplyr
require(dplyr)
time.4 <- system.time({
d %>%
group_by(ID) %>%
summarise(N = n(),
P_mean = mean(P),
P_sd = sd(P),
sets = paste(unique(set), collapse = ",")) -> d_summary.4
})
cat("\ntime.4 =",time.4,"\n")
print(sapply(d_summary.4, class))
# Method 5: data.table
require(data.table)
time.5 <- system.time({
setDT(d)
d_summary.5 <- d[, .(N = .N,
P_mean = mean(P),
P_sd = sd(P),
sets = toString(unique(set))), ID]
d_summary.5 <- as.data.frame(d_summary.5)
})
cat("\ntime.5 =",time.5,"\n")
print(sapply(d_summary.5, class))
# Method 6: aggregate each column separately and merge
time.6 <- system.time({
d_summary.6 <- setNames(as.data.frame(table(d$ID), stringsAsFactors = F),c("ID","N"))
d_summary.6 <- merge(d_summary.6, setNames(aggregate(P ~ ID, data = d, FUN = mean),c("ID","P_mean")), by = "ID")
d_summary.6 <- merge(d_summary.6, setNames(aggregate(P ~ ID, data = d, FUN = sd),c("ID","P_sd")), by = "ID")
d_summary.6 <- merge(d_summary.6, setNames(aggregate(set ~ ID, data = d, FUN = function(x) {paste(unique(x),collapse=",")}),c("ID","sets")), by = "ID")
})
cat("\ntime.6 =",time.6,"\n")
print(sapply(d_summary.6, class))
# Method 7: aggregate each column separately and cbind (this assumes that both table and aggregate will report all values of ID, sorted)
time.7 <- system.time({
d_summary.7 <- setNames(as.data.frame(table(d$ID), stringsAsFactors = F),c("ID","N"))
d_summary.7 <- cbind(d_summary.7, "P_mean" = aggregate(P ~ ID, data = d, FUN = mean)[,2])
d_summary.7 <- cbind(d_summary.7, "P_sd" = aggregate(P ~ ID, data = d, FUN = sd)[,2])
d_summary.7 <- cbind(d_summary.7, "sets" = aggregate(set ~ ID, data = d, FUN = function(x) {paste(unique(x),collapse=",")})[,2])
})
cat("\ntime.7 =",time.7,"\n")
print(sapply(d_summary.7, class))
时间:
time.1 = 1.73 0.02 1.77 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.2.1 = 0.29 0 0.3 NA NA
time.2 = 1.83 0.01 1.84 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.4 = 0.13 0 0.13 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.5 = 0.08 0 0.08 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.6 = 0.25 0 0.25 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.7 = 0.25 0 0.25 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
【问题讨论】: