查找给定数据集中不存在的组合答案

【问题标题】：Finding combinations that do not exist in a given data set查找给定数据集中不存在的组合
【发布时间】：2019-09-02 14:20:00
【问题描述】：

假设我有一个二进制数据集，我想找出哪些组合仍未出现。例如：

X1 X2 X3
1  0  1
0  1  1

正如大家所见，X1=1、X2=1 和 X3=0 的组合并没有发生。顺序无关紧要。是否有任何软件包可以做到这一点，或者有没有其他解决方案？

【问题讨论】：

标签： r combinations permutation

【解决方案1】：

如图所示使用setdiff。没有使用任何包。

DF <- data.frame(X1 = 1:0, X2 = 0:1, X3 = c(1L, 1L)) # test input

g <- do.call("expand.grid", rep(list(0:1), ncol(DF)))
names(g) <- names(DF)

setdiff(g, DF)

给予：

如果意图是 DF 的每一行都有相同数量的 1，我们应该只包含具有该数量 1 的行，然后像这样使用combn。同样，没有使用任何包。

nc <- ncol(DF)
k <- sum(DF[1, ])  # no of 1's in each row of DF

g <- t(combn(nc, k, function(x) +(seq(nc) %in% x)))
g <- as.data.frame(g)

# now repeat the last two lines of the prior approach like this:
names(g) <- names(DF)
setdiff(g, DF)

给予：

X1 X2 X3 
 1  1  0

【讨论】：

【解决方案2】：

生成所有可能的二进制排列，然后对您的数据进行反连接似乎是最简单的方法。

library(gtools)
library(dplyr)

test <- data.frame(V1 = c(1,0), V2 = c(0,1), V3 = c(1,1))

all_perm <- data.frame(permutations(n = 2, r = 3, v = c(0,1), repeats.allowed = TRUE))
colnames(all_perm) <- colnames(test)

anti_join(all_perm, test)

【讨论】：

我有 9 列和 85 个观察值。我应该输入n = 2 和r = 9 吗？
是的。 r 是源向量 v 的大小，这里 v = c(0,1)。 r 是目标的大小，这里是变量的数量，所以 9。

【解决方案3】：

一个应该可以很好地扩展（至少比创建所有排列的方法更好）的有效解决方案是使用 1 值的位置。

#the data
m <- matrix(c(1, 0, 0, 1, 1, 1), 2)
#     [,1] [,2] [,3]
#[1,]    1    0    1
#[2,]    0    1    1

#number of 1 per row
n <- 2

#find positions of 1s
library(Matrix)
M <- Matrix(t(m), sparse = TRUE)
inds <- matrix(M@i + 1L, n, byrow = TRUE)
#     [,1] [,2]
#[1,]    1    3
#[2,]    2    3


#all possible positions
combs <- combn(seq_len(ncol(m)), n, simplify = FALSE)
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 1 3
#
#[[3]]
#[1] 2 3

#missing combs
setdiff(combs, asplit(inds, 1))
#[[1]]
#[1] 1 2

sparseMatrix(j = unlist(mis), 
             i = rep(seq_along(mis), each = n), 
             dims = c(length(mis), ncol(m)))
#1 x 3 sparse Matrix of class "ngCMatrix"
#
#[1,] | | .

【讨论】：