通过在 R 中对变量进行分组来对几列求和答案

【问题标题】：Several column sums by grouping variables in R通过在 R 中对变量进行分组来对几列求和
【发布时间】：2021-06-03 21:35:24
【问题描述】：

我有一个包含词频和其他一些随机人口统计变量的数据框。我想利用两个分组变量，删除不需要的变量，然后根据分组变量对频率求和。

这里和我的差不多

df <- data.frame(user= c(1:9),
                 Group1 = c("a", "a", "a", "b", "b","b","c", "c", "c"),
                 Group2 = c("d", "e", "d", "e", "d", "e", "e", "e", "e"),
                 term1 = c(0, 1, 1, 0, 1, 1, 0, 0, 0),
                 term2 = c(1, 0, 1, 1, 0, 1, 0, 1, 1),
                 term3 = c(0, 1, 0, 0, 0, 0, 1, 1, 0))

这就是我想要得到的。

desired <- data.frame(Group1 = c("a", "a", "b", "b", "c", "c"),
                      Group2 = c("d", "e", "d", "e", "d", "e"),
                      term1 = c(1, 1, 1, 1, 0, 0),
                      term2 = c(2, 0, 0, 2, 0, 2),
                      term3 = c(0, 1, 0, 0, 0, 2))

我的真实框架有大约 4000 个术语列，因此在 dplyr 函数中命名每个个体似乎不可行。

谢谢！

【问题讨论】：

相关 - Aggregate / summarize multiple variables per group (e.g. sum, mean)

标签： r

【解决方案1】：

你可以试试aggregate + expand.grid + merge

merge(
  with(df, expand.grid(Group1 = unique(Group1), Group2 = unique(Group2))),
  aggregate(. ~ Group1 + Group2, df[-1], sum),
  all = TRUE
)

给了

  Group1 Group2 term1 term2 term3
1      a      d     1     2     0
2      a      e     1     0     1
3      b      d     1     0     0
4      b      e     1     2     0
5      c      d    NA    NA    NA
6      c      e     0     2     2

如果你想让NAs 变成0，你可以试试

> res <- merge(
  with(df, expand.grid(Group1 = unique(Group1), Group2 = unique(Group2))),
  aggregate(. ~ Group1 + Group2, df[-1], sum),
  all = TRUE
)

> replace(res, is.na(res), 0)
  Group1 Group2 term1 term2 term3
1      a      d     1     2     0
2      a      e     1     0     1
3      b      d     1     0     0
4      b      e     1     2     0
5      c      d     0     0     0
6      c      e     0     2     2

【讨论】：

【解决方案2】：

我们可以按'Group1，'Group2'分组，获取summarise中'term'列的sum，并用complete扩展数据以查找缺失的组合

library(dplyr)
library(tidyr)
df %>%
     group_by(Group1, Group2) %>% 
     summarise(across(starts_with('term'), sum), .groups = 'drop') %>%
     complete(Group1, Group2, fill = list(term1 = 0, term2 = 0, term3 = 0))

-输出

# A tibble: 6 x 5
  Group1 Group2 term1 term2 term3
  <chr>  <chr>  <dbl> <dbl> <dbl>
1 a      d          1     2     0
2 a      e          1     0     1
3 b      d          1     0     0
4 b      e          1     2     0
5 c      d          0     0     0
6 c      e          0     2     2

【讨论】：

【解决方案3】：

如果您不需要竞争所有变量，setDT(df)[,lapply(.SD[,-1], sum),.(Group1,Group2)] 就足够了。否则，您可以使用包tidyr中的complete（如第一个答案中所用）来填写缺少的变量。

library(data.table)
library(tidyr)

setDT(df)[,lapply(.SD[,-1], sum),.(Group1,Group2)] %>%
    complete(Group1, Group2, fill = list(term1 = 0, term2 = 0, term3 = 0))
#> # A tibble: 6 x 5
#>   Group1 Group2 term1 term2 term3
#>   <chr>  <chr>  <dbl> <dbl> <dbl>
#> 1 a      d          1     2     0
#> 2 a      e          1     0     1
#> 3 b      d          1     0     0
#> 4 b      e          1     2     0
#> 5 c      d          0     0     0
#> 6 c      e          0     2     2

【讨论】：