我的直觉是像@IceCreamToucan 的回答一样使用 .GRP,但另一种方法是联合删除两列的重复项,然后分别检查每列中的重复项:
# data.table
unique(dt[, c("ind1", "ind2")])[, !(anyDuplicated(ind1) || anyDuplicated(ind2))]
# base, with df = data.frame(dt)
with(unique(df[, c("ind1", "ind2")]), !(anyDuplicated(ind1) || anyDuplicated(ind2)))
我尝试了各种基准测试,但没有看到任何明确的结果,但令人惊讶的是,上述两个选项之间的时间几乎总是非常有利于 data.table。
#rows 和#groups 参数示例:
library(data.table)
library(magrittr)
ng = 150
n = 1e6
set.seed(1)
ind1 <- sample(1:ng, n, replace=TRUE )
ind2 <- -ind1
dt <- data.table(ind1=ind1, ind2=ind2)
df = data.frame(dt)
microbenchmark::microbenchmark(times = 3L,
grp = dt[, g1 := .GRP, ind1][, g2 := .GRP, ind2][, all(g1 == g2)],
uniques = dt[, uniqueN(ind2), ind1][, all(V1 == 1)],
shreet = with(dt, max(tapply(ind1, ind2, function(x) length(unique(x))))) == 1L,
shreep = with(dt, tapply(ind1, ind2, . %>% unique %>% length)) %>% max %>% equals(1L),
another = unique(dt[, c("ind1", "ind2")])[, !(anyDuplicated(ind1) || anyDuplicated(ind2))],
banother = with(unique(df[, c("ind1", "ind2")]), !(anyDuplicated(ind1) || anyDuplicated(ind2)))
)
结果:
Unit: milliseconds
expr min lq mean median uq max neval
grp 31.89250 34.92348 46.06510 37.95446 53.15140 68.34833 3
uniques 32.82520 34.36808 36.32377 35.91097 38.07306 40.23515 3
shreet 38.26046 38.35256 44.37116 38.44467 47.42650 56.40834 3
shreep 43.37336 98.56367 145.38600 153.75399 196.39231 239.03064 3
another 14.47064 31.42879 88.20134 48.38694 125.06669 201.74643 3
banother 1338.14070 1427.35481 1658.08404 1516.56893 1818.05572 2119.54251 3