所有行的多列加权平均值答案

【问题标题】：wieghted mean on multiple columns for all rows所有行的多列加权平均值
【发布时间】：2019-01-28 04:22:00
【问题描述】：

我想计算一个巨大数据集的加权平均值。

我需要的是以下（每一行），我有NAs，所以我需要以某种方式合并na.rm = TRUE。我希望计算以下内容（距离 1 到距离 10）：

(distance1 * X1CityNumber + ... + distance10 * X10CityNumber) /
(X1CityNumber + ... + X10CityNumber)

我编写了以下代码，但它产生了错误的数字。

for (i in 1:378742) {
  rcffull$distance[i] <- weighted.mean(cbind(rcffull$distance1[i],
                                             rcffull$distance2[i],
                                             rcffull$distance3[i],
                                             rcffull$distance4[i],
                                             rcffull$distance5[i],
                                             rcffull$distance6[i],
                                             rcffull$distance7[i],
                                             rcffull$distance8[i],
                                             rcffull$distance9[i],
                                             rcffull$distance10[i]),
                                       cbind(rcffull$X1CityNumber[i],
                                             rcffull$X2CityNumber[i],
                                             rcffull$X3CityNumber[i],
                                             rcffull$X4CityNumber[i],
                                             rcffull$X5CityNumber[i],
                                             rcffull$X6CityNumber[i],
                                             rcffull$X7CityNumber[i],
                                             rcffull$X8CityNumber[i],
                                             rcffull$X9CityNumber[i],
                                             rcffull$X10CityNumber[i]),
                                       na.rm = TRUE)
  }

有什么建议吗？

列数较少的样本数据：

 distance1    Weights1    distance2        Weights2    
1    5            3            8              2 
2    NA           2            3              3
3    5            NA           4              4

#desired output:
    Mean distance
1      6.2 #= (5 * 3 + 8 * 2) / (3 + 2)
2      3.0 #= (3 * 3) / 3
3      3.0 #= (4 * 4) / 4

【问题讨论】：

标签： r dataframe matrix mean weighted

【解决方案1】：

NA 发生在权重和距离上。在做(d1 * w1 + d2 * w2) / (w1 + w2) 时，NA 应该从提名和分母中删除，并且权重的归一化需要考虑到这一点。

dat <- structure(list(distance1 = c(5L, NA, 5L), Weights1 = c(3L, 2L, NA),
distance2 = c(8L, 3L, 4L), Weights2 = c(2L, 3L, 4L)), .Names = c("distance1", 
"Weights1", "distance2", "Weights2"), class = "data.frame", row.names = c("1", 
"2", "3"))

A <- as.matrix(dat[c(1, 3)])  ## distance columns
B <- as.matrix(dat[c(2, 4)])  ## weight columns
B[is.na(A)] <- 0
rowSums(A * B, na.rm = TRUE) / rowSums(B, na.rm = TRUE)
#  1   2   3 
#6.2 3.0 4.0

备注1：

如果数据和权重中都没有NA，那么就这样做

rowSums(A * B) / rowSums(B)

备注2：

处理NA的另一种方法：将数据和权重中的所有NA设置为0，然后使用rowSums而不使用na.rm：

ind <- is.na(A) | is.na(B)
A[ind] <- 0
B[ind] <- 0
rowSums(A * B) / rowSums(B)

备注3：

NaN 可能由于0 / 0 而发生，如果没有一对非NA 基准和非NA 权重。

备注4：

weighted.mean 只能删除数据中的NAs，不能删除权重。这也是不希望的，因为您想对所有行进行计算。它没有“矢量化”解决方案；你必须做一个缓慢的 R 级循环。

【讨论】：