【问题标题】:What is the FASTEST way in R to group by a data.table column and count unique values in another column?R 中按 data.table 列分组并计算另一列中的唯一值的最快方法是什么?
【发布时间】:2018-07-26 17:49:22
【问题描述】:

背景:这在交换优化算法中运行。这条特定的行在内部 while 循环中运行,因此它被执行了很多次。循环中的其他所有内容都运行得非常快。

以下创建的示例 data.table “Inventory_test”:

NestCount2 <- c(
  "1","1","1","1","1","1","1","1","2","2","3","3","3","3","3","3",
  "3","3","3","4","4","4","5","5","5","5","5","5","5","5","5","6",
  "6","6","6","6","6","6","6","6","",""
)
Part2 <- c(
  "Shroud","Shroud","Shroud","Shroud","Shroud","Shroud","Shroud",
  "Shroud","S1Nozzle","S1Nozzle","Shroud","Shroud","Shroud","Shroud",
  "Shroud","Shroud","Shroud","Shroud","Shroud","S2Nozzle","S2Nozzle",
  "S2Nozzle","Shroud","Shroud","Shroud","Shroud","Shroud","Shroud",
  "Shroud","Shroud","Shroud","Shroud","Shroud","Shroud","Shroud",
  "Shroud","Shroud","Shroud","Shroud","Shroud","*","*"
)    
Inventory_test <- data.table(data.frame(NestCount2,Part2))
# Methods already tried (have basically exact same performance using profiler):
ptcts <- table(unique(Inventory_test[,c("Part2","NestCount2")])$Part2)
ptcts2 <- Inventory_test[, .(count = uniqueN(NestCount2)), by = Part2]$count

我注意到(使用 Rstudio 分析器)ptcts 行的大约一半时间只是索引Inventory_test[,c("Part2","NestCount2")] 的列。我一直在寻找更快的方法,但没有找到任何方法:(。任何帮助将不胜感激!

【问题讨论】:

  • 这可能对性能无关紧要,但为了理智,可能想要使用data.table(NestCount2,Part2) 而不是data.table(data.frame(NestCount2,Part2))。为了速度,也许...Inventory_test[, .N, by=.(Part2, NestCount2)][, .N, by=Part2]?
  • 数据表上还有setkey
  • 谢谢!我会研究一下setkey。只是为了澄清一下,虽然代码中唯一的一行是“ptcts”,但上面的一切只是为了给这里的人们一个示例 dt 来玩。
  • 也许Inventory_test[, uniqueN(NestCount2), by = Part2]$V1 定义ptcts2,跳过=.() 似乎稍微加快了速度。
  • 通过跳过[] Inventory_test[, uniqueN(paste(Part2, NestCount2)), by=Part2] for ptcts 可能会提高一些边际速度

标签: r data.table


【解决方案1】:

我运行了一些基准测试:到目前为止,看起来最快的方法是不使用 完全没有by,而只是table(),而不是Inventory_test[, rowSums(table(Part2, NestCount2) &gt; 0L)]

library(data.table)
library(microbenchmark)
library(ggplot2)

setkey(Inventory_test, Part2)

microbenchmark(
  unit = "relative",
  m1 = table(unique(Inventory_test[, c("Part2", "NestCount2")])$Part2),
  m2 = Inventory_test[, .(count = uniqueN(NestCount2)), by = Part2]$count,
  m3 = Inventory_test[, .N, by = .(Part2, NestCount2)][, .N, by = Part2],
  m4 = Inventory_test[, uniqueN(NestCount2), by = Part2]$V1,
  m5 = Inventory_test[, uniqueN(paste(Part2, NestCount2)), by = Part2],
  m6 = Inventory_test[, length(unique(NestCount2)), Part2],
  m7 = Inventory_test[, rowSums(table(Part2, NestCount2) > 0L)]
) -> mb

print(mb, digits = 3)
#> Unit: relative
#>  expr  min   lq mean median   uq  max neval cld
#>    m1 1.26 1.27 1.37   1.32 1.60 1.12   100  b 
#>    m2 1.28 1.18 1.29   1.16 1.20 5.93   100  b 
#>    m3 2.21 2.05 2.14   1.98 2.10 3.92   100   c
#>    m4 1.25 1.16 1.23   1.14 1.16 3.97   100 ab 
#>    m5 1.34 1.23 1.28   1.22 1.18 4.27   100 ab 
#>    m6 1.48 1.37 1.35   1.33 1.35 1.18   100  b 
#>    m7 1.00 1.00 1.00   1.00 1.00 1.00   100 a

autoplot(mb)

reprex package (v0.2.0.9000) 于 2018 年 7 月 27 日创建。

PS。有趣的是,data.table(data.frame(NestCount2, Part2)) 实际上data.table(NestCount2, Part2) 快一点。那是因为data.frame() 将字符串强制转换为因子,而这些操作在因子上似乎要快一些。

曾经stringsAsFactors = TRUE 做了一些好事——去看看!

【讨论】:

  • 谢谢!!这肯定有一些改进,在这一点上任何东西都有很大帮助。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-12-08
  • 2018-11-29
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多