【问题标题】:Calculating cumulative sum for multiple columns in R计算R中多列的累积和
【发布时间】:2020-12-29 18:40:30
【问题描述】:

R newb,我正在尝试计算按年、月、组和子组分组的累积总和,也有多个列要计算。

数据样本:

df <- data.frame("Year"=2020,
                "Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
                "Group"=c("A","A","A","B","A","B","B","B"),
                "SubGroup"=c("a","a","b","b","a","b","a","b"),
                "V1"=c(10,10,20,20,50,50,10,10),
                "V2"=c(0,1,2,2,0,5,1,1))
    
       Year Month Group SubGroup V1 V2
    1 2020   Jan     A        a 10  0
    2 2020   Jan     A        a 10  1
    3 2020   Jan     A        b 20  2
    4 2020   Jan     B        b 20  2
    5 2020   Feb     A        a 50  0
    6 2020   Feb     B        b 50  5
    7 2020   Feb     B        a 10  1
    8 2020   Feb     B        b 10  1

想要的结果表:

      Year Month Group SubGroup V1 V2
    1 2020   Jan     A        a 20  1
    2 2020   Feb     A        a 70  1
    3 2020   Jan     A        b 20  2
    4 2020   Feb     A        b 20  2
    5 2020   Jan     B        a  0  0
    6 2020   Feb     B        a 10  1
    7 2020   Jan     B        b 20  2
    8 2020   Feb     B        b 80  8

从样本表中,2020 年 1 月,“A”组子组“a”的总和为 10+10 = 20...在 2020 年 2 月,该值为 50,因此 1 月的 20 + 50 = 70,并且等等……

如果没有值,应该考虑0。

我尝试了一些代码,但没有一个代码甚至没有接近我需要的输出。如果有人能帮助我解决这个问题,我将不胜感激。

【问题讨论】:

    标签: r


    【解决方案1】:

    这是一个简单的group_by/mutate 问题。选择列V1, V2 并应用acrosscumsum

    df$Month <- factor(df$Month, levels = c("Jan", "Feb"))
    
    df %>%
      group_by(Year, Group, SubGroup) %>%
      mutate(across(V1:V2, ~cumsum(.x))) %>%
      ungroup() %>%
      arrange(Year, Group, SubGroup, Month)
    ## A tibble: 8 x 6
    #  Year  Month Group SubGroup    V1    V2
    #  <chr> <fct> <chr> <chr>    <dbl> <dbl>
    #1 2020  Jan   A     a           10     0
    #2 2020  Jan   A     a           20     1
    #3 2020  Feb   A     a           70     1
    #4 2020  Jan   A     b           20     2
    #5 2020  Feb   B     a           10     1
    #6 2020  Jan   B     b           20     2
    #7 2020  Feb   B     b           70     7
    #8 2020  Feb   B     b           80     8
    

    【讨论】:

    • 你可能会提到这个答案需要两个包,dplyrmagrittr(管道运算符%&gt;%
    • 感谢 Rui 的回答,我可能遗漏了一些东西,因为我无法获得与您相同的输出,它正在返回基表但仅重新排序,可能是什么?我已经添加了@CharlieGallagher 提到的包。另外,我没想到会有第 1 行和第 7 行,只有最新的计算。
    【解决方案2】:

    如果我了解您在做什么,您是在计算每个月的总和,然后计算每个月的累积总和。这在dplyr 中通常很容易。

    library(dplyr)
    
    df %>% 
      group_by(Year, Month, Group, SubGroup) %>% 
      summarize(
        V1_sum = sum(V1),
        V2_sum = sum(V2)
      ) %>% 
      group_by(Year, Group, SubGroup) %>% 
      mutate(
        V1_cumsum = cumsum(V1_sum),
        V2_cumsum = cumsum(V2_sum)
      )
    
    
    # A tibble: 6 x 8
    # Groups:   Year, Group, SubGroup [4]
    #   Year Month Group SubGroup V1_sum V2_sum V1_cumsum V2_cumsum
    #   <dbl> <chr> <chr> <chr>     <dbl>  <dbl>     <dbl>     <dbl>
    # 1  2020 Feb   A     a            50      0        50         0
    # 2  2020 Feb   B     a            10      1        10         1
    # 3  2020 Feb   B     b            60      6        60         6
    # 4  2020 Jan   A     a            20      1        70         1
    # 5  2020 Jan   A     b            20      2        20         2
    # 6  2020 Jan   B     b            20      2        80         8
    

    但是您会注意到每月的累计总和是倒数的(即一月在二月之后),因为默认情况下group_by 按字母顺序分组。此外,您看不到空值,因为 dplyr 没有填写它们。

    要确定月份的顺序,您可以将月份设为数字(转换为日期)或将它们转换为因子。您可以通过在基础 R 中使用 aggregate 而不是 dplyr::summarize 来添加分组变量的“缺失”组合。 aggregate 包括分组因素的所有组合。 aggregate 将缺失值转换为 NA,但您可以将 NA 替换为 0 为 tidyr::replace_na,例如。

    library(dplyr)
    library(tidyr)
    
    df <- data.frame("Year"=2020,
                     "Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
                     "Group"=c("A","A","A","B","A","B","B","B"),
                     "SubGroup"=c("a","a","b","b","a","b","a","b"),
                     "V1"=c(10,10,20,20,50,50,10,10),
                     "V2"=c(0,1,2,2,0,5,1,1))
    
    df$Month <- factor(df$Month, levels = c("Jan", "Feb"), ordered = TRUE)
    
    # Get monthly sums
    df1 <- with(df, aggregate(
      list(V1_sum = V1, V2_sum = V2),
      list(Year = Year, Month = Month, Group = Group, SubGroup = SubGroup),
      FUN = sum, drop = FALSE
    ))
    
    df1 <- df1 %>% 
      # Replace NA with 0
      mutate(
        V1_sum = replace_na(V1_sum, 0),
        V2_sum = replace_na(V2_sum, 0)
      ) %>% 
      # Get cumulative sum across months
      group_by(Year, Group, SubGroup) %>% 
      mutate(V1cumsum = cumsum(V1_sum), 
             V2cumsum = cumsum(V2_sum)) %>%
      ungroup() %>% 
      select(Year, Month, Group, SubGroup, V1 = V1cumsum, V2 = V2cumsum)
    

    这给出了与您的示例相同的结果:

    # # A tibble: 8 x 6
    #    Year Month Group SubGroup    V1    V2
    #    <dbl> <ord> <chr> <chr>    <dbl> <dbl>
    # 1  2020 Jan   A     a           20     1
    # 2  2020 Feb   A     a           70     1
    # 3  2020 Jan   B     a            0     0
    # 4  2020 Feb   B     a           10     1
    # 5  2020 Jan   A     b           20     2
    # 6  2020 Feb   A     b           20     2
    # 7  2020 Jan   B     b           20     2
    # 8  2020 Feb   B     b           80     8
    

    【讨论】:

      【解决方案3】:
      library(dplyr)
      library(zoo)
      
      df %>%
        arrange(as.yearmon(paste0(Year, '-', Month), '%Y-%b'), Group, SubGroup) %>%
        group_by(Year, Group, SubGroup) %>% 
        mutate(
               V1 = cumsum(V1),
               V2 = cumsum(V2)
             ) %>% 
        arrange(Year, Group, SubGroup, as.yearmon(paste0(Year, '-', Month), '%Y-%b')) #for desired output ordering
      
      #  A tibble: 8 x 6
      #  Groups:   Year, Group, SubGroup [4]
      #   Year  Month Group SubGroup    V1    V2
      #   <chr> <chr> <chr> <chr>    <dbl> <dbl>
      # 1 2020  Jan   A     a           10     0
      # 2 2020  Jan   A     a           20     1
      # 3 2020  Feb   A     a           70     1
      # 4 2020  Jan   A     b           20     2
      # 5 2020  Feb   B     a           10     1
      # 6 2020  Jan   B     b           20     2
      # 7 2020  Feb   B     b           70     7
      # 8 2020  Feb   B     b           80     8
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-12-17
        • 2018-03-26
        • 1970-01-01
        • 1970-01-01
        • 2019-10-07
        • 2014-05-15
        相关资源
        最近更新 更多