好的,所以我没有很好的技术答案,但我想我已经在 options(datatable.verbose=TRUE) 的帮助下从概念上解决了这个问题
创建数据
library(data.table)
n <- 1e8
y_unkeyed_5groups <- data.table(sample(1:10000,n,replace=TRUE),rnorm(n),rnorm(n))
y_unkeyed_5groups[,sumV2:=sum(V2),keyby=V1]
y_unkeyed_10000groups <- data.table(sample(1:10000,n,replace=TRUE),rnorm(n),rnorm(n))
y_unkeyed_10000groups[,sumV2:=sum(V2),keyby=V1]
慢跑
x <- copy(y)
system.time({
setattr(x, "sorted", c("V1","sumV2"))
x[, list(out=sum(V3*V2)/.BY$sumV2),by=list(V1,sumV2)]
})
# Detected that j uses these columns: V3,V2
# Finding groups using uniqlist on key ... 1.050s elapsed (1.050s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
# lapply optimization is on, j unchanged as 'list(sum(V3 * V2)/.BY$sumV2)'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# memcpy contiguous groups took 0.305s for 6 groups
# eval(j) took 0.254s for 6 calls
# 0.560s elapsed (0.510s cpu)
# user system elapsed
# 1.81 0.09 1.72
跑得快:
x <- copy(y)
system.time(x[, list(out=sum(V3*V2)/sumV2[1],sumV2[1]),keyby=V1])
# Detected that j uses these columns: V3,V2,sumV2
# Finding groups using uniqlist on key ... 0.060s elapsed (0.070s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
# lapply optimization is on, j unchanged as 'list(sum(V3 * V2)/sumV2[1], sumV2[1])'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# memcpy contiguous groups took 0.328s for 6 groups
# eval(j) took 0.291s for 6 calls
# 0.610s elapsed (0.580s cpu)
# user system elapsed
# 1.08 0.08 0.82
finding groups 部分是造成差异的原因。我猜这里发生的事情是设置 key 实际上只是排序(我应该从属性的命名方式中猜到!)并且实际上并没有做任何事情来定义组的开始和结束位置。因此,即使data.table 知道sumV2 已排序,它也不知道它是相同的值,因此必须找到sumV2 中的组的开始和结束位置。
我的猜测是,在技术上可以编写data.table,其中键控不仅排序,而且实际上将每个组的开始/结束行存储在键控变量中,但这可能会占用很多时间具有大量组的 data.tables 的内存。
知道了这一点,似乎建议不要一遍又一遍地重复相同的 by 语句,而是在单个 by 语句中完成您需要做的所有事情。总的来说,这可能是一个很好的建议,但对于少数群体而言,情况并非如此。参见下面的反例:
我以我认为使用 data.table 最快的方式重写了这个(只有一个 by 语句,并使用了 GForce):
library(data.table)
n <- 1e8
y_unkeyed_5groups <- data.table(sample(1:5,n, replace=TRUE),rnorm(n),rnorm(n))
y_unkeyed_10000groups <- data.table(sample(1:10000,n, replace=TRUE),rnorm(n),rnorm(n))
x <- copy(y_unkeyed_5groups)
system.time({
x[, product:=V3*V2]
outDT <- x[,list(sumV2=sum(V2),sumProduct=sum(product)),keyby=V1]
outDT[,`:=`(out=sumProduct/sumV2,sumProduct=NULL) ]
setkey(x,V1)
x[outDT,sumV2:=sumV2,all=TRUE]
x[,product:=NULL]
outDT
})
# Detected that j uses these columns: V3,V2
# Assigning to all 100000000 rows
# Direct plonk of unnamed RHS, no copy.
# Detected that j uses these columns: V2,product
# Finding groups using forderv ... 0.350s elapsed (0.810s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
# lapply optimization is on, j unchanged as 'list(sum(V2), sum(product))'
# GForce optimized j to 'list(gsum(V2), gsum(product))'
# Making each group and running j (GForce TRUE) ... 1.610s elapsed (4.550s cpu)
# Detected that j uses these columns: sumProduct,sumV2
# Assigning to all 5 rows
# RHS for item 1 has been duplicated because NAMED is 3, but then is being plonked. length(values)==2; length(cols)==2)
# forder took 0.98 sec
# reorder took 3.35 sec
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu)
# Detected that j uses these columns: sumV2
# Assigning to 100000000 row subset of 100000000 rows
# Detected that j uses these columns: product
# Assigning to all 100000000 rows
# user system elapsed
# 11.00 1.75 5.33
x2 <- copy(y_unkeyed_5groups)
system.time({
x2[,sumV2:=sum(V2),keyby=V1]
outDT2 <- x2[, list(sumV2=sumV2[1],out=sum(V3*V2)/sumV2[1]),keyby=V1]
})
# Detected that j uses these columns: V2
# Finding groups using forderv ... 0.310s elapsed (0.700s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
# lapply optimization is on, j unchanged as 'sum(V2)'
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# collecting discontiguous groups took 0.714s for 5 groups
# eval(j) took 0.079s for 5 calls
# 1.210s elapsed (1.160s cpu)
# setkey() after the := with keyby= ... forder took 1.03 sec
# reorder took 3.21 sec
# 1.600s elapsed (3.700s cpu)
# Detected that j uses these columns: sumV2,V3,V2
# Finding groups using uniqlist on key ... 0.070s elapsed (0.070s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
# lapply optimization is on, j unchanged as 'list(sumV2[1], sum(V3 * V2)/sumV2[1])'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# memcpy contiguous groups took 0.347s for 5 groups
# eval(j) took 0.265s for 5 calls
# 0.630s elapsed (0.620s cpu)
# user system elapsed
# 6.57 0.98 3.99
all.equal(x,x2)
# TRUE
all.equal(outDT,outDT2)
# TRUE
好吧,事实证明,当只有 5 个组时,不重复语句和使用 GForce 所获得的效率并不重要。但是对于更多的组来说,这确实会有所不同,(尽管我没有写出这样一种方式来区分仅使用一个 by 语句而不是 GForce 与使用 GForce 和多个 by 语句的好处):
x <- copy(y_unkeyed_10000groups)
system.time({
x[, product:=V3*V2]
outDT <- x[,list(sumV2=sum(V2),sumProduct=sum(product)),keyby=V1]
outDT[,`:=`(out=sumProduct/sumV2,sumProduct=NULL) ]
setkey(x,V1)
x[outDT,sumV2:=sumV2,all=TRUE]
x[,product:=NULL]
outDT
})
#
# Detected that j uses these columns: V3,V2
# Assigning to all 100000000 rows
# Direct plonk of unnamed RHS, no copy.
# Detected that j uses these columns: V2,product
# Finding groups using forderv ... 0.740s elapsed (1.220s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
# lapply optimization is on, j unchanged as 'list(sum(V2), sum(product))'
# GForce optimized j to 'list(gsum(V2), gsum(product))'
# Making each group and running j (GForce TRUE) ... 0.810s elapsed (2.390s cpu)
# Detected that j uses these columns: sumProduct,sumV2
# Assigning to all 10000 rows
# RHS for item 1 has been duplicated because NAMED is 3, but then is being plonked. length(values)==2; length(cols)==2)
# forder took 1.97 sec
# reorder took 11.95 sec
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu)
# Detected that j uses these columns: sumV2
# Assigning to 100000000 row subset of 100000000 rows
# Detected that j uses these columns: product
# Assigning to all 100000000 rows
# user system elapsed
# 18.37 2.30 7.31
x2 <- copy(y_unkeyed_10000groups)
system.time({
x2[,sumV2:=sum(V2),keyby=V1]
outDT2 <- x[, list(sumV2=sumV2[1],out=sum(V3*V2)/sumV2[1]),keyby=V1]
})
# Detected that j uses these columns: V2
# Finding groups using forderv ... 0.770s elapsed (1.490s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
# lapply optimization is on, j unchanged as 'sum(V2)'
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# collecting discontiguous groups took 1.792s for 10000 groups
# eval(j) took 0.111s for 10000 calls
# 3.960s elapsed (3.890s cpu)
# setkey() after the := with keyby= ... forder took 1.62 sec
# reorder took 13.69 sec
# 4.660s elapsed (14.4s cpu)
# Detected that j uses these columns: sumV2,V3,V2
# Finding groups using uniqlist on key ... 0.070s elapsed (0.070s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
# lapply optimization is on, j unchanged as 'list(sumV2[1], sum(V3 * V2)/sumV2[1])'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# memcpy contiguous groups took 0.395s for 10000 groups
# eval(j) took 0.284s for 10000 calls
# 0.690s elapsed (0.650s cpu)
# user system elapsed
# 20.49 1.67 10.19
all.equal(x,x2)
# TRUE
all.equal(outDT,outDT2)
# TRUE
更一般地说,data.table 的速度非常快,但为了提取最快和最有效的计算以充分利用底层 C 代码,您需要特别注意 data.table 的内部工作原理。我最近了解了 data.table 中的 GForce 优化,似乎特定形式的 j 语句(涉及简单函数,如 mean 和 sum)在有 by 语句时直接在 C 中解析和执行。