如何更快地按组计算 CJ？（data.table 的交叉连接）答案

【问题标题】：How to compute CJ by groups faster? (data.table's Cross Join)如何更快地按组计算 CJ？（data.table 的交叉连接）
【发布时间】：2019-10-23 22:14:41
【问题描述】：

我需要使用一个大型数据集来（多次）按组计算交叉连接，而且速度很慢。你能告诉我更快的方法吗？

玩具示例：

set.seed(1)
totletter <- 10
LLL <- LETTERS[1:totletter]
nID <- 100000
neach <- 5
nnn <- rep(1:nID, each=neach)  # In my real problem each is not constant
myDT <- data.table(id=paste0("ID",nnn), group=sample(LLL,nID*neach,replace=T))

现在我想让这个功能更快。将每个组的字母与所有可能的字母组合起来。

combi <- myDT[,CJ( unique(group) ,LLL), by=id]

在我的计算机中，nID=100000 个组需要 92 秒。
nID = 1M 大约需要 920 秒。（我需要 1M）。

我知道这与类似问题有关。在许多子组上运行任何函数都很慢：

https://github.com/Rdatatable/data.table/issues/3988 https://github.com/Rdatatable/data.table/issues/3739

我只是需要一些技巧来为 CJ 更快地完成它。

【问题讨论】：

CJ 是内部并行化的。如果每个组中的行数较少，您最好使用setDTthreads(1L) 来消除并行化开销

标签： r performance group-by data.table

【解决方案1】：

我认为一个合理的问题是你打算如何处理这么多的组合。无论如何，这里有 2 个选项：

1) 通过 id 获取唯一组，然后执行交叉连接（请参阅参考资料）

ug <- myDT[, unique(group), id]
ug[, c(.SD, .(LLL=LLL)), seq_len(ug[, .N])][, (1) := NULL]

2) 获取唯一组，然后 CJ 索引并提取与这些索引对应的行

ug <- myDT[, unique(group), id]
idx <- CJ(ug[,seq_len(.N)], seq_along(LLL))
ug[idx$V1, c(.SD, .(LLL=LLL[idx$V2]))]

计时码：

set.seed(1L)
totletter <- 10
LLL <- LETTERS[1:totletter]
nID <- 1e5
neach <- 5
nnn <- rep(1:nID, each=neach)  # In my real problem each is not constant
myDT <- data.table(id=paste0("ID",nnn), group=sample(LLL,nID*neach,replace=T))

mtd0 <- function() myDT[,CJ( unique(group) ,LLL), by=id]

mtd1 <- function() {
    ug <- myDT[, unique(group), id]
    ug[, c(.SD, .(LLL=LLL)), seq_len(ug[, .N])][, (1) := NULL]
}

mtd2 <- function() {
    ug <- myDT[, unique(group), id]
    idx <- CJ(ug[,seq_len(.N)], seq_along(LLL))
    ug[idx$V1, c(.SD, .(LLL=LLL[idx$V2]))]
}    

combi <- mtd0()
setorder(combi, id, V1, LLL)
ans1 <- mtd1()
setorder(ans1, id, V1, LLL)
ans2 <- mtd2()
setorder(ans2, id, V1, LLL)
identical(combi, ans1)
# [1] TRUE
identical(ans1, ans2)
# [1] TRUE

bench::mark(mtd0(), mtd1(), mtd2(), check=FALSE)

时间：

# A tibble: 3 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                   memory                 time     gc              
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                   <list>                 <list>   <list>          
1 mtd0()        1.14m    1.14m    0.0146    1.84GB    0.583     1    40      1.14m <df[,3] [4,094,950 x 3]> <df[,3] [522,766 x 3]> <bch:tm> <tibble [1 x 3]>
2 mtd1()        1.67s    1.67s    0.600   265.05MB    1.80      1     3      1.67s <df[,3] [4,094,950 x 3]> <df[,3] [1,753 x 3]>   <bch:tm> <tibble [1 x 3]>
3 mtd2()     926.29ms 926.29ms    1.08    257.22MB    1.08      1     1   926.29ms <df[,3] [4,094,950 x 3]> <df[,3] [23,859 x 3]>  <bch:tm> <tibble [1 x 3]>

参考：

2个data.tables的交叉连接：https://github.com/Rdatatable/data.table/issues/1717#issuecomment-515002560

编辑以解决 OP 的评论：

其实除了OP的方法的内存使用之外，我认为by的使用会减慢速度，从下面的经验时序可以看出：

set.seed(1L)
totletter <- 10
LLL <- LETTERS[1:totletter]
nID <- 1e5
neach <- 5
nnn <- rep(1:nID, each=neach)  # In my real problem each is not constant
myDT <- data.table(id=paste0("ID",nnn), group=sample(LLL,nID*neach,replace=T))

mtd00 <- function() myDT[,CJ(unique(group), LLL), by=id]
mtd01 <- function() myDT[,CJ(group, LLL, unique=TRUE), by=id]
mtd02 <- function() myDT[, .(group=unique(group)), id][, CJ(group ,LLL), by=id]

mtd1 <- function() {
    ug <- myDT[, unique(group), id]
    ug[, c(.SD, .(LLL=LLL)), seq_len(ug[, .N])][, (1) := NULL]
}

mtd2 <- function() {
    ug <- myDT[, unique(group), id]
    idx <- CJ(ug[,seq_len(.N)], seq_along(LLL))
    ug[idx$V1, c(.SD, .(LLL=LLL[idx$V2]))]
}

时间安排：

# A tibble: 5 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                   memory                 time     gc              
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                   <list>                 <list>   <list>          
1 mtd00()       1.16m    1.16m   0.0143     1.84GB    0.588     1    41      1.16m <df[,3] [4,094,950 x 3]> <df[,3] [515,150 x 3]> <bch:tm> <tibble [1 x 3]>
2 mtd01()       1.72m    1.72m   0.00969    1.85GB    0.427     1    44      1.72m <df[,3] [4,094,950 x 3]> <df[,3] [599,409 x 3]> <bch:tm> <tibble [1 x 3]>
3 mtd02()       1.05m    1.05m   0.0159     1.85GB    0.620     1    39      1.05m <df[,3] [4,094,950 x 3]> <df[,3] [528,108 x 3]> <bch:tm> <tibble [1 x 3]>
4 mtd1()        1.45s    1.45s   0.691    265.11MB    1.38      1     2      1.45s <df[,3] [4,094,950 x 3]> <df[,3] [4,130 x 3]>   <bch:tm> <tibble [1 x 3]>
5 mtd2()        1.11s    1.11s   0.900    257.38MB    1.80      1     2      1.11s <df[,3] [4,094,950 x 3]> <df[,3] [467 x 3]>     <bch:tm> <tibble [1 x 3]>

【讨论】：

为什么这些方法要快得多？应该和原来的CJ做的差不多吧？ Thoug 从缩减的数据集中工作。