选择组合的子集答案

【问题标题】：Select a subset of combinations选择组合的子集
【发布时间】：2013-08-17 18:40:17
【问题描述】：

假设我有一个 20 X 5 矩阵，我想选择矩阵的子集并对它们进行一些计算。进一步假设每个子矩阵是 7 X 5。我当然可以这样做

ncomb <- combn(20, 7)

它为我提供了 7 个行索引的所有可能组合，我可以使用它们来获得子矩阵。但是对于一个小的 20 X 5 矩阵，已经有 77520 种可能的组合。所以我想随机抽取一些组合，例如其中的 5000 个。

一种可能性如下：

ncomb <- combn(20, 7)
ncombsub <- ncomb[, sample(77520, 5000)]

也就是说，我获得了所有可能的组合，然后随机选择其中的 5000 个组合。但我想，如果我有一个更大的矩阵——比如 100 X 7，那么计算所有可能的组合会有问题。

所以我想知道是否有一种方法可以在不首先获得所有可能的组合的情况下获得组合的子集。

【问题讨论】：

是的，我认为这可以通过修改combn 或编写自己的函数（这可能更容易）来实现。为此提出一个算法并实施它应该不难。
您可能想查看相关帖子here
@Roland 我最终按照您的建议修改了combn()。效果很好。

标签： r combinations

【解决方案1】：

你的方法：

op <- function(){
    ncomb <- combn(20, 7)
    ncombsub <- ncomb[, sample(choose(20,7), 5000)]
    return(ncombsub)
}

一种不同的策略，即简单地从原始矩阵中对七行进行 5000 次采样（用新样本替换任何重复的样本，直到找到 5000 个唯一的行组合）：

me <- function(){
  rowsample <- replicate(5000,sort(sample(1:20,7,FALSE)),simplify=FALSE)
  while(length(unique(rowsample))<5000){
     rowsample <- unique(rowsample)
     rowsample <- c(rowsample,
                    replicate(5000-length(rowsample),
                              sort(sample(1:20,7,FALSE)),simplify=FALSE))
  }
  return(do.call(cbind,rowsample))
}

这应该更有效，因为它可以避免您必须先计算所有组合，随着矩阵变大，这将变得昂贵。

然而，一些基准测试表明情况并非如此。至少在这个矩阵上：

library(microbenchmark)
microbenchmark(op(),me())

Unit: milliseconds
 expr      min       lq   median      uq      max neval
 op() 184.5998 201.9861 206.3408 241.430 299.9245   100
 me() 411.7213 422.9740 429.4767 474.047 490.3177   100

【讨论】：

几个问题。为了使您的代码正常工作，我认为您还需要在 while 循环之前对每一列进行排序，即对每个索引样本进行排序。否则，unique() 将不起作用。我认为的第二个问题是unique() 的参数MARGIN 需要设置为2（默认为1）。也不是length(unique(rowsample))，而是ncol(unique(rowsample))。由于length 为您提供了matrix 中包含的元素总数，而不是列数（在我的情况下，每列都是一个样本，因此 5000 列是 5000 个索引样本）。
@Alex 做了一些改变（考虑replicate 返回一个列表，而不是一个矩阵）。事实证明它不如您的原始解决方案有效。而且，如果您允许 replicate 简化为矩阵，它会更慢。
我最终修改了原始的combn() 函数，并对其进行了字节编译。它工作正常。但是无论如何感谢这个解决方案，我认为你的策略对我正在处理的其他一些事情很有用。

【解决方案2】：

我最终按照@Roland 的建议做了，修改了combn()，并对代码进行了字节编译：

combn_sub <- function (x, m, nset = 5000, seed=123, simplify = TRUE, ...) {
    stopifnot(length(m) == 1L)
    if (m < 0) 
        stop("m < 0", domain = NA)
    if (is.numeric(x) && length(x) == 1L && x > 0 && trunc(x) == 
        x) 
        x <- seq_len(x)
    n <- length(x)
    if (n < m) 
        stop("n < m", domain = NA)
    m <- as.integer(m)
    e <- 0
    h <- m
    a <- seq_len(m)
    len.r <- length(r <-  x[a] )
    count <- as.integer(round(choose(n, m)))
    if( count < nset ) nset <- count
    dim.use <- c(m, nset)       

    ##-----MOD 1: Change the output matrix size--------------
    out <- matrix(r, nrow = len.r, ncol = nset) 

    if (m > 0) {
        i <- 2L
        nmmp1 <- n - m + 1L

        ##----MOD 2: Select a subset of indices
        set.seed(seed)
        samp <- sort(c(1, sample( 2:count, nset - 1 )))  

        ##----MOD 3: Start a counter.
        counter <- 2L    

        while (a[1L] != nmmp1 ) {
            if (e < n - h) {
                h <- 1L
                e <- a[m]
                j <- 1L
            }
            else {
                e <- a[m - h]
                h <- h + 1L
                j <- 1L:h
            }
            a[m - h + j] <- e + j

            #-----MOD 4: Whenever the counter matches an index in samp, 
            #a combination of row indices is produced and stored in the matrix `out`
            if(samp[i] == counter){ 
                out[, i] <- x[a]
                if( i == nset ) break
                i <- i + 1L
            }
            #-----Increase the counter by 1 for each iteration of the while-loop
            counter <- counter + 1L
        }
    }
    array(out, dim.use)
}

library("compiler")
comb_sub <- cmpfun(comb_sub)

【讨论】：