检查两个指标是否相同答案

【问题标题】：Check whether two indicators are the same检查两个指标是否相同
【发布时间】：2019-11-25 09:40:19
【问题描述】：

给我一个大数据表，其中包含两个指标 ind1 和 ind2，可能有重复。例如

 set.seed(1)
 ind1 <- sample(1:3,1000, replace=TRUE )
 ind2 <- c("a","b","c")[ind1]

 dt <- data.table(ind1=ind1, ind2=ind2)

我现在想检查一下，这两个指标是否以相同的方式对数据进行分组，即

两行具有相同的指示符ind1 当且仅当它们也具有相同的指示符ind2。在上面的例子中，这就是构造的情况。

【问题讨论】：

标签： r data.table grouping indicator

【解决方案1】：

您可以简单地按ind2 分组并计算不同的ind1，反之亦然。如果任何计数> 1，那么它们不会以相同的方式对数据进行分组。这是使用基础 R 的一种方式 -

any(with(dt, ave(ind1, ind2, FUN = function(x) length(unique(x)))) > 1)

[1] FALSE # means ind1 and ind2 group the data in same way

或者，如果这更容易解释，您可以使用all 检查是否所有计数 == 1 -

all(with(dt, ave(ind1, ind2, FUN = function(x) length(unique(x)))) == 1)

[1] TRUE # means ind1 and ind2 group the data in same way

【讨论】：

【解决方案2】：

您可以使用两个 var 创建一个数字组索引，并检查它们对于所有行是否相等。

这会将两个组索引添加到表中并检查是否相等，但如果需要，您可以删除之后的列

dt[,  g1 := .GRP, ind1][, g2 := .GRP, ind2][, all(g1 == g2)]
#[1] TRUE

编辑：Shree 的独特计数理念更好。 data.table 实现见下文

Edit2：另见 cmets 了解其他解决方案

dt[, uniqueN(ind2), ind1][, all(V1 == 1)]
#[1] TRUE

具有 1e7 行和 10 个组的表格的基准测试，用两个等效列表示

set.seed(1)
ind1 <- sample(1:10,1e7, replace=TRUE )
ind2 <- c("a","b","c")[ind1]

dt <- data.table(ind1=ind1, ind2=ind2)

microbenchmark::microbenchmark(
grp = dt[,  g1 := .GRP, ind1][, g2 := .GRP, ind2][, all(g1 == g2)], 
uniques = dt[, uniqueN(ind2), ind1][, all(V1 == 1)]
)

# Unit: milliseconds
#     expr      min       lq    mean   median       uq       max neval cld
#      grp 727.9489 838.2190 918.280 879.1036 971.3982 1542.9655   100   b
#  uniques 472.1311 502.1327 529.581 526.5357 540.5406  723.5078   100  a

【讨论】：

另一个：unique(dt[, c("ind1", "ind2")])[, !(anyDuplicated(ind1) || anyDuplicated(ind2))]... 和 Shree 的另一个变种，我认为：with(dt, max(tapply(ind1, ind2, function(x) length(unique(x))))) == 1L
仅供参考，如果我扩大问题规模，我发现唯一性非常糟糕：chat.stackoverflow.com/transcript/message/46781006#46781006
对我来说，如果我在链接的基准测试中将 1e4 更改为 1e6，它会反转，因此“唯一”选项比“另一个”选项更好。我会说值得将此信息和其他方法作为单独的答案发布。
好的，是的，自从我之前发表评论以来，我一直在调整 n 和 ng 参数，并且看到了各个方面的比较。将发布另一个答案，谢谢
unique(length(ind2)) 的速度大约是 uniqueN(ind2) 的两倍。

【解决方案3】：

我的直觉是像@IceCreamToucan 的回答一样使用 .GRP，但另一种方法是联合删除两列的重复项，然后分别检查每列中的重复项：

# data.table
unique(dt[, c("ind1", "ind2")])[, !(anyDuplicated(ind1) || anyDuplicated(ind2))]

# base, with df = data.frame(dt)
with(unique(df[, c("ind1", "ind2")]), !(anyDuplicated(ind1) || anyDuplicated(ind2)))

我尝试了各种基准测试，但没有看到任何明确的结果，但令人惊讶的是，上述两个选项之间的时间几乎总是非常有利于 data.table。

#rows 和#groups 参数示例：

library(data.table)
library(magrittr)

ng = 150
n = 1e6
set.seed(1)
ind1 <- sample(1:ng, n, replace=TRUE )
ind2 <- -ind1

dt <- data.table(ind1=ind1, ind2=ind2)
df = data.frame(dt)

microbenchmark::microbenchmark(times = 3L,
grp = dt[,  g1 := .GRP, ind1][, g2 := .GRP, ind2][, all(g1 == g2)], 
uniques = dt[, uniqueN(ind2), ind1][, all(V1 == 1)],
shreet = with(dt, max(tapply(ind1, ind2, function(x) length(unique(x))))) == 1L,
shreep = with(dt, tapply(ind1, ind2, . %>% unique %>% length)) %>% max %>% equals(1L),
another = unique(dt[, c("ind1", "ind2")])[, !(anyDuplicated(ind1) || anyDuplicated(ind2))],
banother = with(unique(df[, c("ind1", "ind2")]), !(anyDuplicated(ind1) || anyDuplicated(ind2)))
)

结果：

Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval
      grp   31.89250   34.92348   46.06510   37.95446   53.15140   68.34833     3
  uniques   32.82520   34.36808   36.32377   35.91097   38.07306   40.23515     3
   shreet   38.26046   38.35256   44.37116   38.44467   47.42650   56.40834     3
   shreep   43.37336   98.56367  145.38600  153.75399  196.39231  239.03064     3
  another   14.47064   31.42879   88.20134   48.38694  125.06669  201.74643     3
 banother 1338.14070 1427.35481 1658.08404 1516.56893 1818.05572 2119.54251     3

【讨论】：

无论使用microbenchmark 或bench::mark，我始终让another 更快。使用Win 7 R-3.5.1 x64 data.table 1.12.2, DTthreads=4, 16GB RAM
我得到another 也是最快的。如果uniques 使用unique(length(ind1))，它对我来说是第二名。我有 2 核 / 4 线程。