比较 m 维数组和沿任意维重复的 (m-1) 维数组答案

【问题标题】：Making comparisons of a m-dimensional array and an (m-1)-dimensional array repeated along an arbitrary dimension比较 m 维数组和沿任意维重复的 (m-1) 维数组
【发布时间】：2015-02-09 18:52:56
【问题描述】：

我已经实现了一个基于多维数组的计算，它替换了一些循环代码。在这个过程中我做了一些我认为可以做得更好的事情——但我不确定如何做。

其中之一是将生成的 3d 数组与沿第三维重复的 2d 数组进行比较。

items12 = c(1,2,3,4,5,6)
items3 = c(1,2,3)

m2d = outer(items12, items12, "-")
m3d = outer(items3, m2d, "*")

经过一些操作后，我想比较 m2d 和 m3d，m2d 沿着第三个暗角重复。我知道两种选择，看起来都不优雅，我很好奇是否有更好的方法。

实例化重复的 3d 数组。内存重但速度快。

m2d.z.3d = outer(
  m2d, 
  rep(1, length(items3)), "*"
)

m3d - m2d.z.3d

循环。轻而慢。

apply(m3d, 3, function(x) {
    x - m2d
})

有什么建议吗？你会选择哪一个？

更新阐明任意索引要求的示例。

items12 = c(1,2,3)
items3 = c(1,2)

m2d = outer(items12, items12, "-")
m3d = outer(m2d,items3, "*")

m3d - (m3d - items.3)

# items.3 wrapped along rows
, , 1

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    1    2    3
[3,]    1    2    3

, , 2

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    1    2    3
[3,]    1    2    3

m3d.yx = aperm(m3d, c(2,1,3))
aperm(m3d.yx - (m3d.yx - c(items.3)), c(2,1,3)) 

#items.3 wrapped around columns
, , 1

     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    2    2    2
[3,]    3    3    3

, , 2

     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    2    2    2
[3,]    3    3    3

更新

在这种情况下的一些基准测试。

items.3 = rep(c(1,2,3), n)
items.2 = rep(c(1,2), n)

m2d = outer(items.3, items.3, "-")
m3d = outer(m2d, items.2, "*")

funRecycle = function() # items.3 wraps around the columns (index 1, then 2, then 3 etc.)
  m3d - (m3d - c(items.3)) 
funAperm = function() { # temporarily interchange index 1 and 2 to apply along desired index
  m3d.yx = aperm(m3d, c(2,1,3))
  aperm(m3d.yx - (m3d.yx - c(items.3)), c(2,1,3)) 
}
funOuter = function() { # assign the 3d matrix
  m2d.z.3d = outer(
    m2d, 
    rep(1, length(items.2)), "*"
  )
  m3d - m2d.z.3d
}
funArray = function() { # assign the 3d matrix with array
  m2d.z.3d = array(m2d, dim=c(dim(m2d)[1:2], length(items.2)))
  m3d - m2d.z.3d
}
funSweep <- function() sweep(m3d, c(1, 2), m2d, "-")

n = 1

Unit: microseconds
         expr    min      lq     mean  median      uq    max neval   cld
 funRecycle()  1.110  1.3875  1.65388  1.6650  1.9420  2.775   100 a    
   funAperm() 17.200 19.1420 21.23113 20.2520 20.9455 69.077   100    d 
   funOuter() 14.426 15.8130 17.58316 17.2005 18.1710 35.232   100   c  
   funArray()  2.774  3.3300  3.95079  3.8840  4.1610 14.148   100  b   
   funSweep() 31.903 32.7360 34.84129 33.5680 34.4000 62.141   100     e

n=100

Unit: milliseconds
         expr       min        lq      mean    median        uq       max
 funRecycle()  28.51351  32.35671  37.13257  33.98931  39.94408  85.94085
   funAperm() 232.69297 276.07494 344.70083 352.40273 395.50492 569.54978
   funOuter()  35.25947  43.98674  53.06895  49.72790  55.93677  95.38608
   funArray()  96.78482 110.10501 119.68267 116.50378 120.70943 172.53973
   funSweep() 150.88675 168.90293 193.06270 178.11013 216.79349 291.23719

我对结果感到惊讶，不知何故，在 n 大时，将所有内容乘以 1 与外部变得比简单地使用 array() 复制数组更快。（在大 n outer() 看起来它可能会比回收方法更快）。

如果我们必须对不同的索引 (funAperm) 进行比较，使用外部构建数组在所有情况下都会快得多。

除了 aperm 之外还有什么建议可以跨任意索引进行比较吗？

【问题讨论】：

听起来像是一个优化问题，这实际上取决于哪些限制更大（内存或时间）。上下文应该决定选择。可以在here 找到一些优化性能的好技巧，也许最好的建议是仅在性能确实存在问题时优化您的代码。
对了，你有没有用sys.time()或者microbenchmark()来验证循环确实比创建新数组+进行比较慢？
感谢您的回复。我已经运行了一些基准测试。我将在下面运行一些包括 Brodie 的建议并稍后发布结果。
另外，不确定您的apply 版本是否与您的outer 版本功能相同。

标签： arrays r

【解决方案1】：

假设您的意思是（我假设这是因为否则 m3d - m2d.z.3d 不起作用）：

m3d = outer(m2d, items3, "*") # note how I switched the arguments

然后这个工作：

m3d - c(m2d)

证明：

all.equal(m3d - c(m2d), m3d - m2d.z.3d)
# [1] TRUE

这里我们只是利用向量循环，因为我们想沿着最后一个维度重复。我们需要使用c() 来消除维度，否则 R 会抱怨数组不兼容（尽管它们实际上是我们想要的特定意义）。

基于对 R 源代码 (src/main/arithmetic.c:real_binary()) 的敷衍审查，看起来向量回收不会复制回收的向量，因此这应该既快速又节省内存。

如果我们想在任意维度上执行此操作，我们必须使用 aperm 重新排列所有维度的数组以使相关维度最后，然后将结果重新排列回原始维度顺序。这会增加一些开销。

至于选择什么方法，如果您没有耗尽内存，请使用快速方法（即避免循环以支持完全矢量化操作）。

另外，items12 <- seq(100) 和 items3 <- seq(50) 的一些基准测试：

funOuter <- function() {
  m2d.z.3d = outer(
    m2d, 
    rep(1, length(items3)), "*"
  )
  m3d - m2d.z.3d
}
funRecycle <- function() m3d - c(m2d)
funLoop <- function() apply(m3d, 3, "-", m2d)    # this does not appear correct because `apply` doesn't reconstruct dimensions like `sapply`
funSweep <- function() sweep(m3d, c(1, 2), m2d)  # this is the same type of thing but works properly

library(microbenchmark)
microbenchmark(funOuter(), funRecycle(), funLoop(), funSweep())

生产：

Unit: milliseconds
         expr       min        lq      mean    median
   funOuter()  2.297287  2.673768  3.232277  2.835404
 funRecycle()  1.327101  1.485082  2.093252  1.599543
    funLoop() 22.579010 24.586667 27.211804 26.840069
   funSweep() 11.251656 12.012664 13.516147 13.736908

并检查结果：

all.equal(funOuter(), funRecycle())
# [1] TRUE
all.equal(funOuter(), funSweep())
# [1] TRUE
all.equal(funOuter(), funLoop())
# Nope, not equal

【讨论】：

嗨布罗迪，感谢您的回复，这是一个很棒的提示，我什至没有考虑过。您对转换的论点也是正确的，对此感到抱歉。我会运行基准测试并发布结果。
另外，aperm() 是我一直在寻找的一个函数，并且即将提交另一个问题来寻找！我的坏方法是创建两个具有不同参数顺序的数组到外部（）。谢谢！
非常感谢，很棒的提示。我学到了很多。我选择你的答案是正确的，但由于代表我无法更新。
我添加了一些任意索引案例的基准。您对更快地完成此任务有什么想法或建议吗？ aperm() 很慢。