从两个不同的data.table中获取相同行的索引答案

【问题标题】：Get indices of equally rows from two different data.table's从两个不同的data.table中获取相同行的索引
【发布时间】：2020-04-02 13:46:26
【问题描述】：

我正在尝试做类似的事情：

R - indices of matching values of two data.tables

这里是上述问题的原始可复制示例：

S.disc <- c(2000,2000)
S.max  <- c(6200,2300)
S.min  <- c(700,100)

Traces.num <- 3
Class.str <- lapply(1:2,function(x) seq(S.min[x],S.max[x],S.disc[x]))
Class.inf <- seq_len(Traces.num)
Actions <- data.table(expand.grid(Class.inf, Class.str[[2]], Class.str[[1]], Class.str[[2]], Class.str[[1]])[,c(5,4,1,3,2)])
setnames(Actions,c("k1","k2","i","l1","l2"))
States <- unique(Actions[,list(k1,k2,i)])

作为 R 初学者，我正在努力将这个示例扩展到所有列。

就我而言，我的第一个 data.table 有 60 列和 220 万行。第二个 data.table 是第一个的子集，即它具有相同的列数 = 60，但行数少得多 = 10 万。

最后，我想要一个长度为 data.table one = 220 万的向量，如果该行同样存在于 data.table 2 中的某处，则值为 TRUE，否则为 FALSE。

我做了一个 for 循环，但效率非常低，需要几个小时才能完成：

S.disc <- c(2000,2000)
S.max  <- c(6200,2300)
S.min  <- c(700,100)

Traces.num <- 3
Class.str <- lapply(1:2,function(x) seq(S.min[x],S.max[x],S.disc[x]))
Class.inf <- seq_len(Traces.num)
Actions <- data.table(expand.grid(Class.inf, Class.str[[2]], Class.str[[1]], Class.str[[2]], Class.str[[1]])[,c(5,4,1,3,2)])
setnames(Actions,c("k1","k2","i","l1","l2"))
States <- as.data.table(sample_n(Actions, 10))


idx_filter <- rep(NA,dim(Actions)[1])

for (a in 1:dim(Actions)[1]){
  for (b in 1:nrow(States))
    if (sum(Actions[a,] == States[b,]) == ncol(Actions)) { idx_filter[a] <- T }
}

idx_filter[is.na(idx_filter)] <- F

如何有效地对 data.table 做同样的事情？

【问题讨论】：

您的意思是b in 1:nrow(States) 而不是b in 1:length(States)？如果您使用length，您有5 个States，使用nrow，您将有10 个（States 中的10 行）。
是的，你是对的！我的意思是 nrow(States)，非常感谢！

标签： r data.table query-optimization

【解决方案1】：

在 data.table >= 1.12.4 的情况下，您还可以使用 on=.NATURAL 连接相交列（请参阅 https://cran.r-project.org/web/packages/data.table/news/news.html 中的 data.table_1.12.4 版本中的第 10 项）。

因此，另一种选择是：

Actions[, idx_filter := FALSE][States, on=.NATURAL, idx_filter := TRUE]

【讨论】：

【解决方案2】：

使用data.table，可以先创建FALSE的列match。然后，您可以使用States 加入并设置与TRUE 匹配的行。然后选择match 逻辑值列。请注意，setkeyv 将对Actions 数据表进行排序。

library(data.table)

setkeyv(setDT(Actions), names(Actions))
setkeyv(setDT(States), names(States))
Actions[ , match := FALSE][States, match := TRUE][ , match]

编辑：正如@chinsoon12 所述，您可以使用on = .NATURAL 并省略setkeyv（只需使用setDT）。我添加了set.seed(123) 以使其具有重现性。看起来是一样的结果。这是我正在使用的完整代码：

library(data.table)

set.seed(123)

S.disc <- c(2000,2000)
S.max  <- c(6200,2300)
S.min  <- c(700,100)

Traces.num <- 3
Class.str <- lapply(1:2,function(x) seq(S.min[x],S.max[x],S.disc[x]))
Class.inf <- seq_len(Traces.num)
Actions <- data.table(expand.grid(Class.inf, Class.str[[2]], Class.str[[1]], Class.str[[2]], Class.str[[1]])[,c(5,4,1,3,2)])
setnames(Actions,c("k1","k2","i","l1","l2"))
States <- as.data.table(sample_n(Actions, 10))

idx_filter <- rep(NA,dim(Actions)[1])

for (a in 1:dim(Actions)[1]){
  for (b in 1:nrow(States))
    if (sum(Actions[a,] == States[b,]) == ncol(Actions)) { idx_filter[a] <- T }
}

idx_filter[is.na(idx_filter)] <- F

#setkeyv(setDT(Actions), names(Actions))
#setkeyv(setDT(States), names(States))

setDT(Actions)
setDT(States)

result <- Actions[ , match := FALSE][States, on=.NATURAL, match := TRUE][ , match]
result

  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 [28] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
 [55] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 [82] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

identical(result, idx_filter) 
[1] TRUE

【讨论】：

我刚刚应用了您的方法，但是当将 data.table “Actions”的“匹配”列与 idx_filter 进行比较时，我没有得到相同的结果： sum(idx_filter) == sum(Actions $match) -> 错误
嗨，Ben，我正面临另一个问题。虽然 sum(idx_filter) == sum(Actions$match) 给了我一个 TRUE，但 sum(idx_filter == Actions$match) 不是。这很重要，因为这个职位对我的预期任务很重要。我需要与 idx_filter 完全相同的输出，有机会更新您的代码吗？非常感谢！
我更新了我的代码（用 nrow(States) 替换了 length(States)）。我正在比较所有列。表状态只是表动作的一个子集，我想要一个长度为 nrow(Actions) 的向量，其中 TRUE 表示在这个特定位置，表动作中的观察存在于表状态的某个位置。请注意，Actions 中的所有行都是唯一的，因此在 States 中也是如此。因此，如果我在 idx_filter 中的位置 x 处具有 TRUE 值，那么我还需要在 Actions$match 中的位置 x 处具有 TRUE。
您的更新版本完美运行，非常感谢！