用于重复距离矩阵计算和超大距离矩阵分块的高效（内存）函数答案

【问题标题】：Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices用于重复距离矩阵计算和超大距离矩阵分块的高效（内存）函数
【发布时间】：2023-12-15 03:57:02
【问题描述】：

我想知道是否有人可以查看以下代码和最小示例并提出改进建议 - 特别是在处理非常大的数据集时代码的效率。

该函数获取一个 data.frame 并通过分组变量（因子）对其进行拆分，然后计算每组中所有行的距离矩阵。

我不需要保留距离矩阵 - 只需要一些统计数据，即平均值、直方图 ..，然后它们可以被丢弃。

我对内存分配等知之甚少，我想知道最好的方法是什么，因为我将处理每组 10.000 - 100.000 个案例。任何想法将不胜感激！

另外，如果我遇到严重的内存问题，将 bigmemory 或其他一些大型数据处理包包含到函数中最不痛苦的方法是什么？

FactorDistances <- function(df) {
  # df is the data frame where the first column is the grouping variable. 
  # find names and number of groups in df (in the example there are three:(2,3,4)
  factor.names <- unique(df[1])
  n.factors <-length(unique(df$factor))
  # split df by factor into list - each subset dataframe is one list element
  df.l<-list()
  for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]}
  # use lapply to go through list and calculate distance matrix for each group
  # this results in a new list where each element is a distance matrix
  distances <- lapply (df.l, function(x) dist(x[,2:length(x)], method="minkowski", p=2))  
  # again use lapply to get the mean distance for each group
  means <- lapply (distances,  mean)  
  rm(distances)
  gc()
  return(means)
}

df <- data.frame(cbind(factor=rep(2:4,2:4), rnorm(9), rnorm(9)))
FactorDistances(df)
# The result are three average euclidean distances between all pairs in each group
# If a group has only one member, the value is NaN

编辑：我编辑了标题以反映我作为答案发布的分块问题..

【问题讨论】：

查看代码后，我怀疑它可能没有完成您想要完成的任务。然而，代码中缺少任何 cmets 使我们无法理解您认为每一行将构建的内容。
抱歉，我现在添加了 cmets（并清除了一些杂乱） - 希望现在更清楚！

标签： r memory-management matrix distance chunking

【解决方案1】：

我已经为那些 dist() 无法处理的超大矩阵提出了一个分块解决方案，我将其发布在这里，以防其他人发现它有帮助（或发现它的错误，拜托！）。它比 dist() 慢得多，但这有点无关紧要，因为它只应该在 dist() 抛出错误时使用 - 通常是以下之一：

"Error in double(N * (N - 1)/2) : vector size specified is too large" 
"Error: cannot allocate vector of size 6.0 Gb"
"Error: negative length vectors are not allowed"

该函数计算矩阵的平均距离，但您可以将其更改为其他任何值，但如果您想实际保存矩阵，我相信某种文件备份的 bigmemory 矩阵是有序的。感谢link感谢这个想法和 Ari 的帮助！

FunDistanceMatrixChunking <- function (df, blockSize=100){
  n <- nrow(df)
  blocks <- n %/% blockSize
  if((n %% blockSize) > 0)blocks <- blocks + 1
  chunk.means <- matrix(NA, nrow=blocks*(blocks+1)/2, ncol= 2)
  dex <- 1:blockSize
  chunk <- 0
  for(i in 1:blocks){    
    p <- dex + (i-1)*blockSize
    lex <- (blockSize+1):(2*blockSize)
    lex <- lex[p<= n]
    p <- p[p<= n]
    for(j in 1:blocks){
      q <- dex +(j-1)*blockSize
      q <- q[q<=n]     
      if (i == j) {       
        chunk <- chunk+1
        x <- dist(df[p,])
        chunk.means[chunk,] <- c(length(x), mean(x))}
      if ( i > j) {
        chunk <- chunk+1
        x <- as.matrix(dist(df[c(q,p),]))[lex,dex] 
        chunk.means[chunk,] <- c(length(x), mean(x))}
    }
  }
  mean <- weighted.mean(chunk.means[,2], chunk.means[,1])
  return(mean)
}
df <- cbind(var1=rnorm(1000), var2=rnorm(1000))
mean(dist(df))
FunDistanceMatrixChunking(df, blockSize=100)

不确定我是否应该将其发布为编辑而不是答案。它确实解决了我的问题，尽管我并没有真正以这种方式指定它..

【讨论】：

将其发布为答案是正确的选择。感谢您将您的解决方案带回社区。span>

【解决方案2】：

一些想法：

unique(df[1]) 可能有效（通过忽略列表的 data.frame 属性），但让我感到紧张并且难以阅读。 unique(df[,1]) 会更好。
for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]} 可以通过 split 完成。
如果您担心内存，绝对不要存储每个级别的整个距离矩阵，然后计算每个因子级别的汇总统计！将您的 lapply 更改为：lapply (df.l, function(x) mean(dist(x[,2:length(x)], method="minkowski", p=2)))。

如果您需要多个汇总统计，请计算两者并返回一个列表：

lapply (df.l, function(x) {
   dmat <- dist(x[,2:length(x)], method="minkowski", p=2)
   list( mean=mean(dmat), median=median(dmat) )
})

看看这是否能让你有所收获。如果没有，您可能需要更加专业化（避免使用lapply，将您的 data.frames 存储为矩阵等）

【讨论】：

(1.) split，当然！ (2.) 你的最后一段代码正是我要找的，因为我计算的不仅仅是平均值，但不知道如何把它们弄出来！ (3.) 虽然代码现在可以在合理的大小下正常工作，但它确实以“无法分配大小为 2.9 Gb 的向量”结尾，所以我需要找到另一个解决方案。你说的“避免 lapply”是什么意思？我不认为 data.frames 是内存方面的问题，只是距离矩阵..
您可能对lapply 没问题，这取决于垃圾收集是否在轮次之间运行（它可能确实...R 对此非常好）。所以不用担心。对于您的 2.9Gb 问题，我会找出哪个组触发错误并自行运行距离矩阵计算。如果这仍然产生错误，那么您就知道应该关注哪里了。或者只是获得一个具有大量内存的 Amazon EC2 集群并在其上运行它。 2.9Gb 不是那么大。
谢谢！错误是由 40Kby40K 距离矩阵触发的，但除了上面的组之外，我还打算做所有的对（200K），希望不诉诸采样。在我尝试 EC2 之前，我将尝试分块进行。我需要手段，还需要直方图，但如果我确保休息时间相同，我可以“总结”这些。我发现这个 link 看起来很有帮助，虽然我不确定我是否可以从 filebacked.big.matrix 中获得直方图，所以我将从 block 选项开始..
分块执行是个好主意。即使您可以将其全部计算为一个大文件支持的矩阵，也不一定更好......磁盘很慢。