在 r 中提取不同值的最快方法答案

【问题标题】：the fastest method of extracting distinct values in r在 r 中提取不同值的最快方法
【发布时间】：2021-07-08 15:11:22
【问题描述】：

我想重新创建这篇文章中演示的提取排序唯一值的最快方法的示例：What is the fastest way to get a vector of sorted unique values from a data.table?

test_df <-
  data.frame(
    company = c(1, 1,  2, 2, 3)
  )

unique_values = df[,logical(1), keyby = company]$company

但我不断收到错误：

[.data.frame(df, , logical(1), keyby = company) 中的错误：未使用论点（keyby = 公司）

编辑。请注意，我的问题的重点是让这个特定的方法起作用。其他实现目标的方法建议，请关注我参考的帖子。

【问题讨论】：

制作df <- data.table::as.data.table(test_df)
如果您不需要对它们进行排序：unique(test_df$company) 或者在 base 中的排序仍然不会那么慢：sort(unique(test_df$company))
@GKi unique(test_df$company) 在大 df 上明显变慢。这就是为什么我想让这个例子工作。
这可能是多核/线程的情况。如果您只使用一个核心或总结每个线程的时间，应该不会有太大差异。
您的示例不起作用，因为您创建了一个data.frame 并希望使用data.table 的方法。所以在我的第一条评论中添加这一行来转换它或直接创建一个data.table。

标签： r distinct distinct-values

【解决方案1】：

如果您正在寻找快速的unique，请查看kit::funique 或collapse::funique：

setDTthreads(1)
microbenchmark::microbenchmark(
dt = y[,logical(1), keyby = company]$company,
base = unique(x$company),
collapse = collapse::funique(x$company),
kit = kit::funique(x$company))
#Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#       dt 12.862388 13.575131 14.759180 14.248541 14.945780 49.930937   100
#     base 12.939646 13.505176 14.734066 14.773846 15.415468 18.256204   100
# collapse  3.302862  3.589133  3.685685  3.692886  3.773045  4.063564   100
#      kit  1.903043  2.433478  2.963308  2.882986  3.076537  6.183840   100

setDTthreads(4)
microbenchmark::microbenchmark(
dt = y[,logical(1), keyby = company]$company,
base = unique(x$company),
collapse = collapse::funique(x$company),
kit = kit::funique(x$company))
#Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#       dt  5.480513  7.384032  7.873730  7.569420  8.346282 11.193741   100
#     base 12.998406 13.295775 14.464446 13.736353 14.856721 47.320488   100
# collapse  3.333292  3.549712  3.655851  3.645528  3.737236  4.325676   100
#      kit  1.881232  2.825040  2.959422  2.917149  3.004288  5.281440   100

数据和库：

set.seed(42)
n <- 1e6
company <- c("A", "S", "W", "L", "T", "T", "W", "A", "T", "W")
item <- c("Thingy", "Thingy", "Widget", "Thingy", "Grommit", 
          "Thingy", "Grommit", "Thingy", "Widget", "Thingy")
sales <- c(120, 140, 160, 180, 200, 120, 140, 160, 180, 200)

x <- data.frame(company = sample(company, n, TRUE), 
                      item = sample(item, n, TRUE), 
                sales = sample(sales, n, TRUE))

library(data.table)
y <- as.data.table(x)

【讨论】：

谢谢，不。我想以我在问题中发布的方式获得独特的价值。套件包的推荐可能属于我提到的问题。我看到你已经在那里发布了这个替代方案。