多列的加权平均值，按组（在 data.table 中）答案

【问题标题】：Weighted means for several columns, by groups (in a data.table)多列的加权平均值，按组（在 data.table 中）
【发布时间】：2014-09-24 14:22:27
【问题描述】：

这个问题跟在group weighted means 上的另一个问题有关：我想使用data.table 创建加权组内平均值。与最初的问题不同的是，要平均的变量的名称是在一个字符串向量中指定的。

数据：

df <- read.table(text= "
          region    state  county  weights y1980  y1990  y2000
             1        1       1       10     100    200     50
             1        1       2        5      50    100    200
             1        1       3      120    1000    500    250
             1        1       4        2      25    100    400
             1        1       4       15     125    150    200
             2        2       1        1      10     50    150
             2        2       2       10      10     10    200
             2        2       2       40      40    100     30
             2        2       3       20     100    100     10
", header=TRUE, na.strings=NA)

使用 Roland 对上述问题的建议答案：

library(data.table)
dt <- as.data.table(df)
dt2 <- dt[,lapply(.SD,weighted.mean,w=weights),by=list(region,state,county)]

我有一个带有字符串的向量，用于动态确定我想要组内加权平均值的列。

colsToKeep = c("y1980","y1990")

但我不知道如何将它作为 data.table 魔术的参数传递。

我试过了

 dt[,lapply(
      as.list(colsToKeep),weighted.mean,w=weights),
      by=list(region,state,county)]`

但我得到：

Error in x * w : non-numeric argument to binary operator

不知道如何实现我想要的。

额外问题：我希望保留原始列名，而不是获取 V1 和 V2。

注意，我使用的是 1.9.3 版的 data.table 包。

【问题讨论】：

标签： r data.table

【解决方案1】：

通常，您应该能够做到：

dt2 <- dt[,lapply(.SD,weighted.mean,w=weights), 
          by = list(region,state,county), .SDcols = colsToKeep]

即，只需将这些列提供给.SDcols。但目前，这不起作用due to a bug，因为weights 列将不可用，因为它没有在.SDcols 中指定。

在修复之前，我们可以按如下方式完成：

dt2 <- dt[, lapply(mget(colsToKeep), weighted.mean, w = weights), 
            by = list(region, state, county)]
#    region state county     y1980    y1990
# 1:      1     1      1  100.0000 200.0000
# 2:      1     1      2   50.0000 100.0000
# 3:      1     1      3 1000.0000 500.0000
# 4:      1     1      4  113.2353 144.1176
# 5:      2     2      1   10.0000  50.0000
# 6:      2     2      2   34.0000  82.0000
# 7:      2     2      3  100.0000 100.0000

【讨论】：

Bug 仍然存在还是不再推荐第一种方法？ 2015 年 12 月 16 日，我收到了这个消息：Error in as.double(w) : cannot coerce type 'closure' to vector of type 'double'
bug 尚未修复，抱歉 :-(。你可以这样做：dt[, lapply(mget(colsToKeep), weighted.mean, w=weights), by=.(region,state,country)]. Your error seems to indicate that you're using as.double` 使用函数作为输入（不相关）。
谢谢。所以基本上你的建议是使用mget() 而不是as.list(.SD)[] 对吗？（我知道您在by= 之后使用的点是list 的简写，因此该代码与您上面的解决方法相同）（关于错误消息，我想我只是复制粘贴了 OP 的数据，但没有通过 data.frame。）
是的，没错。我将对其进行编辑以替换 as.list()。我一定是在那时写的，因为mget() dint 当时工作（这是另一个错误，但我们设法修复了）。

【解决方案2】：

我不知道data.table，但你考虑过使用dplyr吗？我认为它几乎和data.table一样快

library(dplyr)
df %>% 
  group_by(region, state, county) %>% 
  summarise(mean_80 = weighted.mean(y1980, weights), 
            mean_90 = weighted.mean(y1990, weights))
Source: local data frame [7 x 5]
Groups: region, state

  region state county   mean_80  mean_90
1      1     1      1  100.0000 200.0000
2      1     1      2   50.0000 100.0000
3      1     1      3 1000.0000 500.0000
4      1     1      4  113.2353 144.1176
5      2     2      1   10.0000  50.0000
6      2     2      2   34.0000  82.0000
7      2     2      3  100.0000 100.0000

【讨论】：

感谢您的帮助，但我需要使用 data.table，而且您的回答也没有解决我的问题的新约束，即列必须由向量动态指定。跨度>
我的错，我应该更仔细地阅读你的帖子。如果您决定切换到 dplyr，here's 一个可能有用的示例