【问题标题】:How to perform further grouping and lookups on aggregated values in .SD for a data.table -如何在 .SD 中对 data.table 的聚合值执行进一步分组和查找 -
【发布时间】:2015-07-08 21:35:02
【问题描述】:

这与之前的 question 相关,在 data.table 上进行分组/查找,但有额外的输出。

我正在尝试对子集 .SD 执行分组并从每个子集中选择值。例如,在航班数据集中,我想找出:对于每个机场和月份,UniqueCarrier 和 Destination 的平均 ArrDelay 最差。所以基本上有2个级别的聚合。

我有如下的工作解决方案。但如果有更好的解决方案,最好了解一下。

library(data.table)
library(hflights)

DT <- as.data.table(hflights)

setkey(DT, Origin, Month)

#The solution code...
DT[, {
 t1 <- .SD[, .(mean(na.omit(ArrDelay))) , by=UniqueCarrier];
 max1 <- which.max(t1$V1);
 t2 <- .SD[, .(mean(na.omit(ArrDelay))) , by=Dest];
 max2 <- which.max(t2$V1);
 list( MaxAvgDelayForCarrier = t1$UniqueCarrier[max1], MaxAvgDelayByCarrier = t1$V1[max1],  MaxAvgDelayByDest= t2$Dest[max2], MaxAvgDelayForDest= t2$V1[max2] )
},  by = .(Origin, Month)]

# Checking for correctness
head(DT[ .("HOU", 1), .(MaxAvgDelayByCarrier=mean(na.omit(ArrDelay))), by=UniqueCarrier][order(-MaxAvgDelayByCarrier)],1)
head(DT[ .("IAH", 2), .(MaxAvgDelayForDest=mean(na.omit(ArrDelay))), by=Dest][order(-MaxAvgDelayForDest)],1)

【问题讨论】:

    标签: r data.table


    【解决方案1】:

    我认为你的代码很好,但我会这样写:

    DT[,c(
    
      .SD[, 
        .(CMaxVal = mean(na.omit(ArrDelay))),
      by=.(CMax = UniqueCarrier)][which.max(CMaxVal)],
    
      .SD[, 
        .(DMaxVal = mean(na.omit(ArrDelay))),
      by=.(DMax = Dest)][which.max(DMaxVal)]
    
    ),by=key(DT)]
    

    给了

        Origin Month CMax   CMaxVal DMax  DMaxVal
     1:    HOU     1   F9 13.725806  PHL 20.12500
     2:    HOU     2   B6 17.822222  ECP 20.17308
     3:    HOU     3   EV 23.088889  PHL 46.06452
     4:    HOU     4   EV 27.847826  PHL 67.93333
     5:    HOU     5   EV 25.436620  PHL 75.61290
     6:    HOU     6   EV 16.930233  EWR 34.87755
     7:    HOU     7   B6 20.016129  CHS 21.54839
     8:    HOU     8   B6 30.163636  JFK 30.16364
     9:    HOU     9   DL 18.625000  EWR 14.32143
    10:    HOU    10   DL 17.803279  PHL 22.51613
    11:    HOU    11   F9  3.000000  EWR 18.46429
    12:    HOU    12   MQ 13.554502  EWR 28.17857
    13:    IAH     1   EV 15.682353  HNL 21.52632
    14:    IAH     2   MQ 19.946809  BPT 29.00000
    15:    IAH     3   AS 15.354839  SFO 27.43590
    16:    IAH     4   MQ 16.263441  SEA 22.48515
    17:    IAH     5   MQ 25.179104  DAY 25.96154
    18:    IAH     6   UA 24.453125  ANC 34.06667
    19:    IAH     7   OO 15.117419  DSM 32.39286
    20:    IAH     8   UA 17.297561  ANC 37.96552
    21:    IAH     9   UA 11.620000  SJU 16.76923
    22:    IAH    10   UA 11.601266  CID 16.88462
    23:    IAH    11   MQ  8.445545  CID 18.04167
    24:    IAH    12   XE 11.376852  HOB 25.95556
        Origin Month CMax   CMaxVal DMax  DMaxVal
    

    没有必要存储这么多中间对象(t1max1 等)。


    上述方法需要为每个分组变量手动编码。你可以这样做......

    DT[,do.call(c,lapply(c("UniqueCarrier","Dest"),
      function(colname) setNames(
        .SD[,mean(na.omit(ArrDelay)),by=c(colname)][which.max(V1)],
        paste0(c("max_","maxval_"),colname)
      )
    )),by=key(DT)]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-01-08
      • 2023-01-20
      • 2017-02-14
      • 1970-01-01
      • 1970-01-01
      • 2017-09-19
      • 2021-07-24
      • 1970-01-01
      相关资源
      最近更新 更多