【问题标题】:Aggregating if each observation can belong to multiple groups聚合每个观察是否可以属于多个组
【发布时间】:2018-05-23 08:35:15
【问题描述】:

我想按组聚合日期。但是,每个观察可以属于多个组(例如,观察 1 属于 A 组和 B 组)。我找不到使用data.table 实现此目的的好方法。目前,我为每个可能的组创建了一个逻辑变量,如果观察属于该组,则该变量取值 TRUE。我正在寻找比下面介绍的更好的方法来做到这一点。我还想知道如何使用tidyverse 实现这一目标。

library(data.table)
# Data
set.seed(1)
TF <- c(TRUE, FALSE)
time <- rep(1:4, each = 5)
df <- data.table(time = time, x = rnorm(20), groupA = sample(TF, size = 20, replace = TRUE),
                                             groupB = sample(TF, size = 20, replace = TRUE),
                                             groupC = sample(TF, size = 20, replace = TRUE))

# This should be nicer and less repetitive
df[groupA == TRUE, .(A = sum(x)), by = time][
  df[groupB == TRUE, .(B = sum(x)), by = time], on = "time"][
    df[groupC == TRUE, .(C = sum(x)), by = time], on = "time"]

# desired output
time          A          B         C
1:    1         NA  0.9432955 0.1331984
2:    2  1.2257538  0.2427420 0.1882493
3:    3 -0.1992284 -0.1992284 1.9016244
4:    4  0.5327774  0.9438362 0.9276459

【问题讨论】:

    标签: r dplyr data.table tidyverse


    【解决方案1】:

    这是data.table的解决方案:

    df[, lapply(.SD[, .(groupA, groupB, groupC)]*x, sum), time]
    # > df[, lapply(.SD[, .(groupA, groupB, groupC)]*x, sum), time]
    #    time     groupA     groupB    groupC
    # 1:    1  0.0000000  0.9432955 0.1331984
    # 2:    2  1.2257538  0.2427420 0.1882493
    # 3:    3 -0.1992284 -0.1992284 1.9016244
    # 4:    4  0.5327774  0.9438362 0.9276459
    

    或(感谢@chinsoon12 的评论)以编程方式:

    df[, lapply(.SD*x, sum), by=.(time), .SDcols=paste0("group", c("A","B","C"))]
    

    如果你想要长格式的结果,你可以这样做:

    df[, colSums(.SD*x), by=.(time), .SDcols=paste0("group", c("A","B","C"))]
    ### with indicator for the group:
    df[, .(colSums(.SD*x), c("A","B","C")), by=.(time), .SDcols=paste0("group", c("A","B","C"))] 
    

    【讨论】:

    • +1 甜蜜!还有更多的编程方式:df[, lapply(.SD*x, sum), by=.(time), .SDcols=paste0("group", c("A","B","C"))]
    • 不错的解决方案。但是,如果我为长格式输出运行您的代码,则缺少组变量。如何解决这个问题?
    • @jogo 你能看看我对此的后续问题吗?stackoverflow.com/questions/50487229/…
    【解决方案2】:

    我认为在这里以长格式工作更容易。首先,我将观察结果收集为长格式,然后只保留观察结果属于相应组的值。然后我删除逻辑列,并将组重命名为单个字母。然后我跨组和跨时间汇总(总结在dplyr)。 最后我又展开回宽幅。

    library(dplyr)
    library(tidyr)
    
    set.seed(1)
    TF <- c(TRUE, FALSE)
    time <- rep(1:4, each = 5)
    
    
    df <- data.frame(time = time, x = rnorm(20), groupA = sample(TF, size = 20, replace = TRUE),
                     groupB = sample(TF, size = 20, replace = TRUE),
                     groupC = sample(TF, size = 20, replace = TRUE))
    
    
    df %>% 
      gather(group, belongs, groupA:groupC) %>% 
      filter(belongs) %>% 
      select(-belongs) %>% 
      mutate(group = gsub("group", "", group)) %>% 
      group_by(time, group) %>% 
      summarise(x = sum(x)) %>% 
      spread(group, x)
    

    输出

    # A tibble: 4 x 4
    # Groups:   time [4]
       time       A      B     C
      <int>   <dbl>  <dbl> <dbl>
    1     1  NA      0.943 0.133
    2     2   1.23   0.243 0.188
    3     3  -0.199 -0.199 1.90 
    4     4   0.533  0.944 0.928
    

    【讨论】:

      【解决方案3】:

      一个选项可以将tidyrdplyr 包与data.table 结合使用。尝试处理长格式的数据,然后将其更改为宽格式。

      library(dplyr)
      library(tidyr)
      
      melt(df, id.vars = c("time", "x")) %>%
        filter(value) %>%
        group_by(time, variable) %>%
        summarise(sum = sum(x)) %>%
        spread(variable, sum)
      
      # # A tibble: 4 x 4
      # # Groups: time [4]
      # time  groupA groupB groupC
      # * <int>   <dbl>  <dbl>  <dbl>
      # 1     1  NA      0.943  0.133
      # 2     2   1.23   0.243  0.188
      # 3     3 - 0.199 -0.199  1.90 
      # 4     4   0.533  0.944  0.928
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2011-06-17
        • 1970-01-01
        • 1970-01-01
        • 2018-08-02
        • 1970-01-01
        • 2021-12-30
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多