列表列表中元素替换的慢循环答案

【问题标题】：Slow loop for element replacement in list of lists列表列表中元素替换的慢循环
【发布时间】：2015-09-10 04:18:56
【问题描述】：

我编写了一个循环来有条件地替换列表列表中的元素。虽然对 R 来说还是比较新的，但我确信我不会尽可能高效地处理这个问题。以下循环在我的实际数据上运行非常缓慢（一个小时左右）。我在下面包含了一个最小的工作示例，它完全复制了我的数据结构。

A <- matrix(c(0, 1, 1, 2, 0, 0, 1, 0, 1, 2, 0, 0), nrow = 2, ncol = 6, byrow = TRUE)
B <- matrix(c(1, 1, 1, 2, 0, 1, 1, 0, 1, 2, 0, 0), nrow = 2, ncol = 6, byrow = TRUE)
C <- matrix(c(1, 0, 0, 1, 0, 1), nrow = 1, ncol = 6, byrow = TRUE)
D <- matrix(c(0, 0, 0, 1, 1, 1), nrow = 1, ncol = 6, byrow = TRUE)
mList <-list(list(A, B))
dList <- list(list(C, D))

如果dList 的第n 项的元素j 等于0，则循环的目标是用0 替换mList2 的第n 项的第j 列中的所有单元格。

mList
# [[1]]
# [[1]][[1]]
#     [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    0    1    1    2    0    0
# [2,]    1    0    1    2    0    0
#
# [[1]][[2]]
#     [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    1    1    1    2    0    1
# [2,]    1    0    1    2    0    0

dList
# [[1]]
# [[1]][[1]]
#     [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    1    0    0    1    0    1
#
# [[1]][[2]]
#     [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    0    0    0    1    1    1

这是使用apply 函数集的另一个未实现收益的实例吗？有没有更好的不涉及使用四个索引的方法？

for(i in 1:length(dList)) {
    for(j in 1:length(dList[[i]])) {
        for(k in 1:length(dList[[i]][[j]])) {
            for(m in 1:nrow(mList[[i]][[j]])) {
                mList[[i]][[j]][m, k] <- 
                    ifelse(
                        dList[[i]][[j]][k] == 1, 
                        mList[[i]][[j]][m, k], 
                        0
                    )
            }
        }
    }
}

导致：

mList
# [[1]]
# [[1]][[1]]
#     [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    0    0    0    2    0    0
# [2,]    1    0    0    2    0    0

# [[1]][[2]]
#     [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    0    0    0    2    0    1
# [2,]    0    0    0    2    0    0

【问题讨论】：

我想我可能知道您遇到速度问题的原因。首先，为什么你有嵌套列表？
第一层的每个列表都是一年，每一年都包含很多邻接矩阵。数据以这种方式组织，以便能够轻松访问特定年份和矩阵。现在我想了想，我可以取消列出外部列表并将年份合并到每个矩阵名称中。

标签： r list for-loop

【解决方案1】：

我会使用嵌套的lapply 循环遍历输入列表的嵌套结构，使用单个向量化操作而不是循环遍历列和行来重新计算mList 中的相关条目：

lapply(seq_along(dList), function(i) {
  lapply(seq_along(dList[[i]]), function(j) {
    t(t(mList[[i]][[j]]) * as.vector(dList[[i]][[j]] != 0))
  })
})
# [[1]]
# [[1]][[1]]
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    0    0    0    2    0    0
# [2,]    1    0    0    2    0    0
# 
# [[1]][[2]]
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    0    0    0    2    0    1
# [2,]    0    0    0    2    0    0

这是对相同结构列表的基准测试，mList 中有 10 x 10000 矩阵。我已经对您提供的解决方案、@thelatemail 的解决方案和我的解决方案进行了基准测试：

set.seed(144)
A <- matrix(sample(0:2, 100000, replace=TRUE), nrow=10)
B <- matrix(sample(0:2, 100000, replace=TRUE), nrow=10)
C <- matrix(sample(0:1, 10000, replace=TRUE), nrow=1)
D <- matrix(sample(0:1, 10000, replace=TRUE), nrow=1)
mList <-list(list(A, B))
dList <- list(list(C, D))

OP <- function(mList, dList) {
  for(i in 1:length(dList)) {
    for(j in 1:length(dList[[i]])) {
        for(k in 1:ncol(dList[[i]][[j]])) {
            for(m in 1:nrow(mList[[i]][[j]])) {
                mList[[i]][[j]][m, k] <- 
                    ifelse(
                        dList[[i]][[j]][k] == 1, 
                        mList[[i]][[j]][m, k], 
                        0
                    )
            }
        }
    }
  }
  mList
}
josilber <- function(mList, dList) {
  lapply(seq_along(dList), function(i) {
    lapply(seq_along(dList[[i]]), function(j) {
      t(t(mList[[i]][[j]]) * as.vector(dList[[i]][[j]] != 0))
    })
  })
}
thelatemail <- function(mList, dList) {
  Map(
    function(L,s) Map(function(sL,ss) {sL[,ss] <- 0; sL}, L, s),
    mList,
    lapply(dList, function(x) lapply(x, function(y) y==0) )
  )
}

library(microbenchmark)
microbenchmark(OP(mList, dList), josilber(mList, dList), thelatemail(mList, dList), times=10)
# Unit: milliseconds
#                       expr          min           lq         mean       median           uq          max neval
#           OP(mList, dList) 12252.468288 13318.745019 13478.116388 13486.732412 13840.106332 14259.053497    10
#     josilber(mList, dList)     2.299442     2.401806     2.561809     2.480822     2.552620     3.511609    10
#  thelatemail(mList, dList)     4.259594     4.438562     4.683855     4.612297     5.002605     5.122605    10

这两种解决方案的运行速度都快了 1000 倍以上，主要是因为它们没有在矩阵中紧密循环，而是以向量化的方式执行运算。

【讨论】：

不错的基准测试。以秒为单位表示时间确实带来了改进：OP:12.252s josilber:0.002s thelatemail:0.004s
非常感谢！这运行得更快。我前面有很多代码去循环。

【解决方案2】：

它并不漂亮，但您基本上总是需要使用嵌套列表结构进行双重循环。

Map(
  function(L,s) Map(function(sL,ss) {sL[,ss] <- 0; sL}, L, s),
  mList,
  lapply(dList, function(x) lapply(x, function(y) y==0) )
)
#[[1]]
#[[1]][[1]]
#     [,1] [,2] [,3] [,4] [,5] [,6]
#[1,]    0    0    0    2    0    0
#[2,]    1    0    0    2    0    0
#
#[[1]][[2]]
#     [,1] [,2] [,3] [,4] [,5] [,6]
#[1,]    0    0    0    2    0    1
#[2,]    0    0    0    2    0    0

我刚刚将所有 for() 循环替换为嵌套的 lapply 以生成要覆盖的列列表，然后使用嵌套的 Map 替换每个列表中的列。

【讨论】：