对于 R 中的大矩阵，如何有效地计算所有可能组合中的归一化比率？答案

【问题标题】：How to calculate normalized ratios in all possible combinations efficiently for a large matrix in R?对于 R 中的大矩阵，如何有效地计算所有可能组合中的归一化比率？
【发布时间】：2020-07-16 10:17:56
【问题描述】：

我想为 R 中的一个大矩阵有效地计算所有可能组合中的归一化比率。我之前已经问过一个类似的问题here，并且数据量很小，那里提供的解决方案运行良好。但是，当我尝试对大型数据集 (400 x 2151) 应用相同的解决方案时，我的系统就会挂起。我的系统有 16 GB RAM 和 Intel i7 处理器。这是带有数据的代码

df <- matrix(rexp(860400), nrow = 400, ncol = 2151)

@Ronak Shah 提供的解决方案

cols <- 1:ncol(df)
temp <- expand.grid(cols, cols)
new_data <- (df[,temp[,2]] - df[,temp[,1]])/(df[,temp[,2]] + df[,temp[,1]])

或@akrun提供的以下解决方案

f1 <- function(i, j) (df[, i] - df[, j])/(df[, i] + df[, j])
out <- outer(seq_along(df), seq_along(df), FUN = f1)
colnames(out) <- outer(names(df), names(df), paste, sep = "_")

这两种解决方案都需要很长时间并且系统正在挂起。那么，我怎样才能有效地做到这一点呢？

【问题讨论】：

问题是expand.grid 和outer 会为更大的向量生成巨大的对象。结果，大量时间都花在了内存管理上（而且您很容易耗尽内存）。在这种情况下，最有效和最简单的解决方案是使用 Rcpp 编写的简单 C++ double for 循环。
@Roland 我对 Rcpp 不太熟悉，能帮帮我吗？
好吧，在重读你的问题之后：你真的需要存储所有可能组合的结果吗？你计算出有多少种组合？
我认为这将是 2151*2151 = 4626801 组合。是的，我需要存储所有可能组合的结果以供进一步计算。
根据进一步的计算，您可能不需要存储它们。无论如何，是的，使用 Rcpp。

标签： r combinations tidyverse

【解决方案1】：

既然内存似乎是您的主要问题，那么使用迭代器怎么样？使用包 RcppAlgos^*，我们可以利用 permuteIter 来计算您的比率N一次。

如果必须有名字，我们需要一个额外的迭代器。这意味着您必须保持 2 个迭代器同步，这可能会变得乏味。幸运的是，使用permuteIter 的summary() 方法，我们总能看到当前索引是什么，并使用多种选项重置它们（例如随机访问[[、front()、back() 或@987654328 @)。

library(RcppAlgos)
df <- matrix(rexp(860400), nrow = 400, ncol = 2151)

ratioIter <- permuteIter(ncol(df), 2, FUN = function(x) {
    (df[, x[2]] - df[, x[1]]) / (df[, x[2]] + df[, x[1]])
})

## if you really want to name your output, you must have
## an additional name iterator... not very elegant
nameIter <- permuteIter(paste0("col", 1:ncol(df1)), 2, FUN = function(x) {
    paste0(rev(x), collapse = "_")
})

firstIter <- matrix(ratioIter$nextIter(), ncol = 1)
firstName <- nameIter$nextIter()
colnames(firstIter) <- firstName

head(firstIter)
      col2_col1
[1,]  0.2990054
[2,] -0.9808111
[3,] -0.9041054
[4,]  0.7970873
[5,]  0.8625776
[6,]  0.2768359

## returns a list, so we call do.call(cbind
next5Iter <- do.call(cbind, ratioIter$nextNIter(5))
next5Names <- unlist(nameIter$nextNIter(5))
colnames(next5Iter) <- next5Names

head(next5Iter)
       col3_col1  col4_col1   col5_col1  col6_col1  col7_col1
[1,] -0.28099710  0.1665687  0.40565958 -0.7524038 -0.7132844
[2,] -0.81434900 -0.4283759 -0.89811556 -0.8462906 -0.5399741
[3,] -0.02289368  0.4285012  0.05087853 -0.5091659 -0.2328995
[4,] -0.06825458  0.3126928  0.68968843 -0.2180618  0.6651785
[5,]  0.33508319  0.7389108  0.84733425  0.9065263  0.8977107
[6,]  0.61773589  0.3443120  0.61084584  0.5727938  0.3888807

您应该注意，这不会显示i == j 的结果（这些给出NaN）。所以总数刚好低于 2151²（实际上正好等于2151^2 - 2151）。

ratioIter$summary()
$description
[1] "Permutations of 2151 choose 2"

$currentIndex
[1] 6

$totalResults
[1] 4624650

$totalRemaining
[1] 4624644

甚至还有随机访问和以前的迭代器：

## Get the last ratio
lastIter <- ratioIter$back()
lastName <- nameIter$back()
mLast <- matrix(lastIter, ncol = 1)
colnames(mLast) <- lastName

head(mLast)
     col2150_col2151
[1,]      -0.6131926
[2,]       0.9936783
[3,]       0.1373538
[4,]       0.1014347
[5,]      -0.5061608
[6,]       0.5773503

## iterate backwards with the previous methods
prev5Iter <- do.call(cbind, ratioIter$prevNIter(5))
prev5Names <- unlist(nameIter$prevNIter(5))
colnames(prev5Iter) <- prev5Names

head(prev5Iter)
     col2149_col2151 col2148_col2151 col2147_col2151 col2146_col2151 col2145_col2151
[1,]     -0.75500069     -0.72757136     -0.94457988     -0.82858884     -0.25398782
[2,]      0.99696694      0.99674084      0.99778638      0.99826472      0.95738947
[3,]      0.27701596      0.45696010      0.00682574      0.01529448     -0.62368764
[4,]     -0.09508689     -0.90698165     -0.38221934     -0.41405984      0.01371556
[5,]     -0.31580709     -0.06561386     -0.07435058     -0.08033145     -0.90692881
[6,]      0.82697720      0.86858595      0.81707206      0.75627297      0.46272349

## Get a random sample
set.seed(123)
randomIter <- do.call(cbind, ratioIter[[sample(4624650, 5)]])

## We must reset the seed in order to get the same output for the names
set.seed(123)
randomNames <- unlist(nameIter[[sample(4624650, 5)]])
colnames(randomIter) <- randomNames

head(randomIter)
     col1044_col939 col20_col1552 col412_col2014 col1751_col1521 col337_col1295
[1,]     -0.3902066     0.4482747   -0.108018200      -0.1662857     -0.3822436
[2,]     -0.2358101     0.9266657   -0.657135882       0.0671608     -0.6821823
[3,]     -0.7054217     0.8944720    0.092363665       0.2667708      0.1908249
[4,]     -0.1574657     0.2775225   -0.221737223       0.3381454     -0.5705021
[5,]     -0.4282909    -0.4406433    0.092783086      -0.7506674     -0.1276932
[6,]      0.9998189    -0.2497586   -0.009375891       0.7071864     -0.2425258

最后，它是用C++写的，所以速度非常快：

system.time(ratioIter$nextNIter(1e3))
#  user  system elapsed 
#     0       0       0

^*我是RcppAlgos的作者

【讨论】：

感谢您的回答。但我无法理解我应该改变什么来获得输出以及如何正确命名输出列以便我可以回溯组合？
@BappaDas 我展示了一个通过do.call(bind 获取输出的示例。这些名称是用类似的逻辑轻松获得的。我已经更新了我的答案，但这不是我的首选方式。获得名字很好，但它确实弄脏了方法。
如何获取整个数据的输出？您的答案仅显示 5 次迭代。
是的，这就是迭代器的目的。你一次创建几个，然后用它们做你需要的。正如已经解释的那样，outer 和expand.grid 在速度方面非常有效，但是生成与示例中的对象一样大的对象表明您的方法需要改变。您的原始方法将需要大量内存。这就是我建议使用迭代器的原因。如果您真的需要它们，您可以随时致电nextRemaining()。我强烈建议阅读文档并重新考虑您的方法。希望这会有所帮助。