【问题标题】:Using dplyr for dynamic group_by将 dplyr 用于动态 group_by
【发布时间】:2017-11-16 19:07:58
【问题描述】:

试图了解这个dplyr 的东西。我有一个排序的数据框,我想根据一个变量对其进行分组。但是,需要构造这些组,以便每个组在分组变量上的最小总和为 30。

考虑这个小示例数据框:

df1 <- matrix(data = c(05,0.9,95,12,0.8,31,
    16,0.8,28,17,0.7,10,
        23,0.8,11,55,0.6,9,
    56,0.5,12,57,0.2,1,
    59,0.4,1),
  ncol = 3,
  byrow = TRUE,
  dimnames = list(c(1:9), 
    c('freq', 'mean', 'count')
  )
)

现在,我想进行分组,使 count 的总和至少为 30。freqmean 然后应该折叠成一个 weighted.mean,其中权重是 count 值。请注意,最后一个“bin”在第 7 行达到了 32 的总和,但由于第 8:9 行的总和仅为 2,所以我将它们添加到最后一个“bin”中。

像这样:

freq   mean   count
 5.00  0.90   95
12.00  0.80   31
16.26  0.77   38
45.18  0.61   34

dplyr简单的总结是没有问题的,但是这个我想不通。我确实认为解决方案隐藏在此处:

Dynamic Grouping in R | Grouping based on condition on applied function

但是如何将它应用到我的情况却让我无法理解。

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    我希望我有一个更短的解决方案,但这是我想出的。

    首先我们定义一个自定义的 cumsum 函数:

    cumsum2 <- function(x){
      Reduce(function(.x,.y){
        if(tail(.x,1)>30) x1 <- 0 else x1 <- tail(.x,1) ;c(.x,x1+.y)},x,0)[-1]
    }
    # cumsum2(1:10)
    # [1]  1  3  6 10 15 21 28 36  9 19
    

    然后我们可以玩dplyr链:

    library(dplyr)
    library(tidyr)
    
    df1 %>%
      as.data.frame %>%                        # as you started with a matrix
      mutate(id = row_number(),                # we'll need this to sort in the end
             cumcount = cumsum2(count))    %>% # adding nex cumulate count
      `[<-`(.$cumcount < 30,"cumcount",NA) %>% # setting as NA values less than 30 ...
      fill(cumcount,.direction = "up")     %>% # ... in order to fill them with cumcount
      fill(cumcount,.direction = "down")   %>% # the last NAs belong to the last group so we fill down too
      group_by(cumcount)                   %>% # these are our new groups to aggregate freq and mean
      summarize(id = min(id),
                freq = sum(freq*count)/sum(count),
                mean = sum(mean*count)/sum(count)) %>%
      arrange(id)                          %>% # sort
      select(freq,mean,count=cumcount)         # and lay out as expected output
    
    # # A tibble: 4 x 3
    #       freq      mean count
    #      <dbl>     <dbl> <dbl>
    # 1  5.00000 0.9000000    95
    # 2 12.00000 0.8000000    31
    # 3 16.26316 0.7736842    38
    # 4 45.17647 0.6117647    32
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2015-02-17
      • 1970-01-01
      • 2019-03-20
      • 2015-09-02
      • 1970-01-01
      • 1970-01-01
      • 2014-05-03
      • 2021-04-17
      相关资源
      最近更新 更多