【问题标题】:Several column sums by grouping variables in R通过在 R 中对变量进行分组来对几列求和
【发布时间】:2021-06-03 21:35:24
【问题描述】:

我有一个包含词频和其他一些随机人口统计变量的数据框。我想利用两个分组变量,删除不需要的变量,然后根据分组变量对频率求和。

这里和我的差不多

df <- data.frame(user= c(1:9),
                 Group1 = c("a", "a", "a", "b", "b","b","c", "c", "c"),
                 Group2 = c("d", "e", "d", "e", "d", "e", "e", "e", "e"),
                 term1 = c(0, 1, 1, 0, 1, 1, 0, 0, 0),
                 term2 = c(1, 0, 1, 1, 0, 1, 0, 1, 1),
                 term3 = c(0, 1, 0, 0, 0, 0, 1, 1, 0))

这就是我想要得到的。

desired <- data.frame(Group1 = c("a", "a", "b", "b", "c", "c"),
                      Group2 = c("d", "e", "d", "e", "d", "e"),
                      term1 = c(1, 1, 1, 1, 0, 0),
                      term2 = c(2, 0, 0, 2, 0, 2),
                      term3 = c(0, 1, 0, 0, 0, 2))

我的真实框架有大约 4000 个术语列,因此在 dplyr 函数中命名每个个体似乎不可行。

谢谢!

【问题讨论】:

标签: r


【解决方案1】:

你可以试试aggregate + expand.grid + merge

merge(
  with(df, expand.grid(Group1 = unique(Group1), Group2 = unique(Group2))),
  aggregate(. ~ Group1 + Group2, df[-1], sum),
  all = TRUE
)

给了

  Group1 Group2 term1 term2 term3
1      a      d     1     2     0
2      a      e     1     0     1
3      b      d     1     0     0
4      b      e     1     2     0
5      c      d    NA    NA    NA
6      c      e     0     2     2

如果你想让NAs 变成0,你可以试试

> res <- merge(
  with(df, expand.grid(Group1 = unique(Group1), Group2 = unique(Group2))),
  aggregate(. ~ Group1 + Group2, df[-1], sum),
  all = TRUE
)

> replace(res, is.na(res), 0)
  Group1 Group2 term1 term2 term3
1      a      d     1     2     0
2      a      e     1     0     1
3      b      d     1     0     0
4      b      e     1     2     0
5      c      d     0     0     0
6      c      e     0     2     2

【讨论】:

    【解决方案2】:

    我们可以按'Group1,'Group2'分组,获取summarise中'term'列的sum,并用complete扩展数据以查找缺失的组合

    library(dplyr)
    library(tidyr)
    df %>%
         group_by(Group1, Group2) %>% 
         summarise(across(starts_with('term'), sum), .groups = 'drop') %>%
         complete(Group1, Group2, fill = list(term1 = 0, term2 = 0, term3 = 0))
    

    -输出

    # A tibble: 6 x 5
      Group1 Group2 term1 term2 term3
      <chr>  <chr>  <dbl> <dbl> <dbl>
    1 a      d          1     2     0
    2 a      e          1     0     1
    3 b      d          1     0     0
    4 b      e          1     2     0
    5 c      d          0     0     0
    6 c      e          0     2     2
    

    【讨论】:

      【解决方案3】:

      如果您不需要竞争所有变量,setDT(df)[,lapply(.SD[,-1], sum),.(Group1,Group2)] 就足够了。否则,您可以使用包tidyr中的complete(如第一个答案中所用)来填写 缺少的变量。

      library(data.table)
      library(tidyr)
      
      setDT(df)[,lapply(.SD[,-1], sum),.(Group1,Group2)] %>%
          complete(Group1, Group2, fill = list(term1 = 0, term2 = 0, term3 = 0))
      #> # A tibble: 6 x 5
      #>   Group1 Group2 term1 term2 term3
      #>   <chr>  <chr>  <dbl> <dbl> <dbl>
      #> 1 a      d          1     2     0
      #> 2 a      e          1     0     1
      #> 3 b      d          1     0     0
      #> 4 b      e          1     2     0
      #> 5 c      d          0     0     0
      #> 6 c      e          0     2     2
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-08-21
        • 2017-08-12
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多