根据另一个矩阵计算矩阵子集的每列平均值（或其他函数）答案

【问题标题】：calculate mean (or other function) per column for subsets of a matrix based on another matrix根据另一个矩阵计算矩阵子集的每列平均值（或其他函数）
【发布时间】：2015-02-19 02:58:00
【问题描述】：

我在 R 中使用一个分类器，该分类器输出一个实数值矩阵，我要分类的每个类有一列。然后我将一个函数应用于输出矩阵和我的类标签矩阵（每类一列）以计算每类（列）的误差。

这适用于小型数据集以及类和非类行的相等分布，但是当我使用具有类与非类分布偏斜分布的较大文件时，它就会崩溃。通常我的文件包含少于 0.3% 的类与 99.7% 的非类，在这种情况下，我的分类器倾向于简单地输出非类值 (0)。

我想尝试不同的错误（成本）函数来平衡这一点。我也会尝试上下采样，但他们还有其他问题。我想尝试的一个可能的简单更改是分别计算 1 类和 0 类的错误，然后将这些错误组合起来，使类错误不会被压倒性的非类错误所掩盖。

我提供了一个最低限度的工作示例来帮助演示我想要什么。

    L1 <- runif(13, min=0, max=1)
    L2 <- runif(13, min=0, max=1)
    predy <- cbind(L1, L2) # simulated output from the classifier
    #predy
    L1 <- c(0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
    L2 <- c(0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0)
    classy <- cbind(L1, L2) # Simulated class matrix
    #classy
    # Now compute error showing existing method
    mse <- apply((predy - classy)^2, 2, mean)
    nrmse <- sqrt(mse / apply(classy, 2, var))
    #
    #nrmse
    # L1       L2
    # 1.343796 1.062442
    #
    # Sort-of-code for what I would like to have
    # mse0 <- apply((predy - classy)^2, 2, mean) where x=0
    # mse1 <- apply((predy - classy)^2, 2, mean) where x=1
    # mse <- (mse0 + mse1) / 2   # or some similar way of combining them of my choice
    # nrmse <- sqrt(mse / apply(classy, 2, var))

此外，我的文件很大，我的分类器模型也很大，因此以计算效率高的方式执行此操作会非常有帮助。

我设法使用 for 循环（如下）来做到这一点，有人可以帮忙翻译一下吗？

    mean.ones  <- matrix(0, dim(classy)[2])
    mean.zeros <- matrix(0, dim(classy)[2])
    for (ix in 1:dim(classy)[2]) {
        ix.ones <- classy[, ix]==1
        mean.ones[ix]  <- mean(predy[ix.ones, ix])
        mean.zeros[ix] <- mean(predy[!ix.ones, ix])
    }

上面的代码和原来的代码不一样，它只是计算条件均值，但是代码流似乎是正确的。

【问题讨论】：

要添加额外的复杂性，我的方法必须适用于具有任意列数的矩阵。我的示例显示了两列，但该解决方案应该适用于一、二、三或任意数量的列。这就是我不使用提取物的原因。
当您说“类和非类行”时，您的意思是“案例 (1) 和非案例 (0) 行”吗？
设法使用 for 循环做到这一点，有人可以帮忙翻译一下吗？
Jthorpe - 谢谢，是的（我认为）。我想我们使用不同的术语。我使用 for 循环的示例有帮助吗？
是的，我添加了复制下面mean.zeros 和mean.ones 的代码。如果mse 代表“均方误差”，那么我认为您在 for 循环之后的代码不正确。

标签： r for-loop apply lapply tapply

【解决方案1】：

这是一个利用 (1) 词法作用域的解决方案，因此您不必将矩阵传递给传递给第一个 lapply() 的汇总函数，并且 (2) predy 和 classy 具有相同的尺寸。

这里是条件均值的计算：

# calculation of means
temp <- lapply(seq.int(ncol(predy)),
               function(i)tapply(predy[,i],
                                 classy[,i],
                                 mean))
# presumably each column has members of both classes,
# but if not, we'll assure that there are two members 
# two each element of the list 'temp', as follows:
temp <- lapply(temp,
               function(x)x[match(0:1,names(x))])

# bind the outputs togeather by column.
mean_mx = do.call(cbind,temp)
all(mean_mx[1,]==mean.zeros)
all(mean_mx[2,]==mean.ones)

这里是均方误差的计算：

# calculation of MSE
temp <- lapply(seq.int(ncol(predy)),
               function(i)tapply((predy[,i] - classy[,i])^2,
                                 classy[,i],
                                 mean))
# presumably each column has members of both classes,
# but if not, we'll assure that there are two members 
# two each element of the list 'temp', as follows:
temp <- lapply(temp,
               function(x)x[match(0:1,names(x))])

# bind the outputs togeather by column.
mse_mx = do.call(cbind,temp)

mse0 <- mse_mx[1,]
mse1 <- mse_mx[2,]
mse <- (mse0 + mse1) / 2 

nrmse <- sqrt(mse / apply(classy, 2, var))

【讨论】：

感谢 Jthorpe。我会试试这个然后回来。
不，很抱歉让 Jthorpe 感到困惑。每个类列只有零或一作为值。
优秀的 Jthorpe，差不多了。我对您的代码进行了少量编辑，因为您计算 nrmse 的两种替代方法在我的电脑上给出了不同的答案，而您的第一个建议 (nrmse_correct) 给出了与我的原始代码相同的答案。分母确实有相同的尺寸？？？非常感谢您！我需要坐下来试着完全理解它
在我的代码的早期版本中（我认为在对 OP 进行一些编辑之前？）尺寸不同，但那段代码现在无关紧要，所以我删除了它。跨度>