【问题标题】:How can I compute statistics by decile groups in data.table如何按 data.table 中的十分位组计算统计信息
【发布时间】:2014-04-29 10:08:03
【问题描述】:

我有一个 data.table,想按组计算统计数据。

R) set.seed(1)
R) DT=data.table(a=rnorm(100),b=rnorm(100))

这些组应该由

定义
R) quantile(DT$a,probs=seq(.1,.9,.1))
           10%            20%            30%            40%            50%            60%            70%            80%            90% 
-1.05265747329 -0.61386923071 -0.37534201964 -0.07670312896  0.11390916079  0.37707993057  0.58121734252  0.77125359976  1.18106507751 

我如何计算每个 bin 的平均 b,如果 b=-.5 我在 [-0.61386923071,-0.37534201964] 之内,那么在 bin 3

【问题讨论】:

    标签: r data.table quantile


    【解决方案1】:

    怎么样:

    > DT[, mean(b), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
                    cut          V1
    1:               NA -0.31359818
    2:   (-1.05,-0.614] -0.14103182
    3:  (-0.614,-0.375] -0.33474492
    4: (-0.375,-0.0767]  0.20827735
    5:  (-0.0767,0.114]  0.14890251
    6:    (0.114,0.377]  0.16685304
    7:    (0.377,0.581]  0.07086979
    8:    (0.581,0.771]  0.17950572
    9:     (0.771,1.18] -0.04951607
    

    为了看看那个 NA(无论如何都要检查结果),我接下来做了:

    > DT[, list(mean(b),.N,list(a)), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
                    cut          V1  N                                                                                                                      V3
    1:               NA -0.31359818 20                1.59528080213779,1.51178116845085,-2.2146998871775,-1.98935169586337,-1.47075238389927,1.35867955152904,
    2:   (-1.05,-0.614] -0.14103182 10        -0.626453810742332,-0.835628612410047,-0.820468384118015,-0.621240580541804,-0.68875569454952,-0.70749515696212,
    3:  (-0.614,-0.375] -0.33474492 10        -0.47815005510862,-0.41499456329968,-0.394289953710349,-0.612026393250771,-0.443291873218433,-0.589520946188072,
    4: (-0.375,-0.0767]  0.20827735 10      -0.305388387156356,-0.155795506705329,-0.102787727342996,-0.164523596253587,-0.253361680136508,-0.112346212150228,
    5:  (-0.0767,0.114]  0.14890251 10 -0.0449336090152309,-0.0161902630989461,0.0745649833651906,-0.0561287395290008,-0.0538050405829051,-0.0593133967111857,
    6:    (0.114,0.377]  0.16685304 10             0.183643324222082,0.329507771815361,0.36458196213683,0.341119691424425,0.188792299514343,0.153253338211898,
    7:    (0.377,0.581]  0.07086979 10            0.487429052428485,0.575781351653492,0.389843236411431,0.417941560199702,0.387671611559369,0.556663198673657,
    8:    (0.581,0.771]  0.17950572 10             0.738324705129217,0.593901321217509,0.61982574789471,0.763175748457544,0.696963375404737,0.768532924515416,
    9:     (0.771,1.18] -0.04951607 10              1.12493091814311,0.943836210685299,0.821221195098089,0.918977371608218,0.782136300731067,1.10002537198388,
    

    除此之外:我已经返回了一个list 列(每个单元格本身就是一个向量),以便快速查看进入垃圾箱的值,只是为了检查。 data.table 打印时显示逗号(并且每个单元格仅显示前 6 个项目),但 V3 的每个单元格实际上都有一个数字向量。

    因此,第一个和最后一个 break 之外的值被一起编码为 NA。我不清楚如何告诉cut 不要那样做。所以我只是添加了 -Inf 和 +Inf :

    > DT[,list(mean(b),.N),keyby=cut(a,c(-Inf,quantile(a,probs=seq(.1,.9,.1)),+Inf))]
                     cut          V1  N
     1:     (-Inf,-1.05] -0.16938368 10
     2:   (-1.05,-0.614] -0.14103182 10
     3:  (-0.614,-0.375] -0.33474492 10
     4: (-0.375,-0.0767]  0.20827735 10
     5:  (-0.0767,0.114]  0.14890251 10
     6:    (0.114,0.377]  0.16685304 10
     7:    (0.377,0.581]  0.07086979 10
     8:    (0.581,0.771]  0.17950572 10
     9:     (0.771,1.18] -0.04951607 10
    10:      (1.18, Inf] -0.45781268 10
    

    这样更好。或者:

    > DT[, list(mean(b),.N), keyby=cut(a,quantile(a,probs=seq(0,1,.1)),include=TRUE)]
                     cut          V1  N
     1:    [-2.21,-1.05] -0.16938368 10
     2:   (-1.05,-0.614] -0.14103182 10
     3:  (-0.614,-0.375] -0.33474492 10
     4: (-0.375,-0.0767]  0.20827735 10
     5:  (-0.0767,0.114]  0.14890251 10
     6:    (0.114,0.377]  0.16685304 10
     7:    (0.377,0.581]  0.07086979 10
     8:    (0.581,0.771]  0.17950572 10
     9:     (0.771,1.18] -0.04951607 10
    10:       (1.18,2.4] -0.45781268 10
    

    这样您就可以看到最小值和最大值是什么,而不是显示 -Inf 和 +Inf。请注意,您需要将 include=TRUE 传递给 cut 否则将返回 11 个 bin,而第一个 bin 中只有 1 个。

    【讨论】:

    • 我错过了那个剪切命令!很酷,这很完美......我也从未使用过 keyby
    • 这里为什么使用keyby而不是by
    • @hadley 对垃圾箱进行排序。 by= 按首次出现顺序返回组。
    • @MattDowle 当我运行DT[, list(mean(b),.N,list(a)), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))] 时,我得到Error in setkeyv(ans, names(ans)[seq_along(byval)]) Item 4 of list is not a vector。但是其他 3 个命令工作正常。如果我运行DT[, list(mean(b),.N,paste0(as.character(a),collapse=",")), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))],那就行得通。我在 R3.0.1 Win7 上运行 DT 1.9.2。知道这里发生了什么吗?
    【解决方案2】:

    我经常做这种事情,所以我在我的 R 包中为它写了一个非常灵活的 bin_data() 方法 - mltools。它完全基于data.table,并利用了新的non-equi joins

    要回答您的具体问题,请将 Bin1 设置为 DT 中的一列,然后按 Bin1 分组

    library(data.table)
    library(mltools)
    
    DT[, Bin1 := bin_data(vals=a, bins=seq(.1, .9, .1), binType="quantile")]
    DT[, list(mean(b)), keyby=Bin1]
    
                                            Bin1          V1
    1:                                        NA -0.31359818
    2:   [-1.05265747329296, -0.613869230708978) -0.14103182
    3:  [-0.613869230708978, -0.375342019639661) -0.33474492
    4: [-0.375342019639661, -0.0767031289639095)  0.20827735
    5:  [-0.0767031289639095, 0.113909160788544)  0.14890251
    6:    [0.113909160788544, 0.377079930573521)  0.16685304
    7:    [0.377079930573521, 0.581217342522697)  0.07086979
    8:    [0.581217342522697, 0.771253599758546)  0.17950572
    9:      [0.771253599758546, 1.1810650775142] -0.04951607
    

    你也可以做其他很酷的事情

    按分位数制作 10 个等间距的 bin

    DT[, Bin2 := bin_data(vals=a, bins=10, binType="quantile")]
    DT[, list(mean(b)), keyby=Bin2]
    
                                             Bin2          V1
     1:     [-2.2146998871775, -1.05265747329296) -0.16938368
     2:   [-1.05265747329296, -0.613869230708978) -0.14103182
     3:  [-0.613869230708978, -0.375342019639661) -0.33474492
     4: [-0.375342019639661, -0.0767031289639095)  0.20827735
     5:  [-0.0767031289639095, 0.113909160788544)  0.14890251
     6:    [0.113909160788544, 0.377079930573521)  0.16685304
     7:    [0.377079930573521, 0.581217342522697)  0.07086979
     8:    [0.581217342522697, 0.771253599758546)  0.17950572
     9:      [0.771253599758546, 1.1810650775142) -0.04951607
    10:       [1.1810650775142, 2.40161776050478] -0.45781268
    

    使最后一个边界左闭右开

    DT[, Bin3 := bin_data(vals=a, bins=10, binType="quantile", boundaryType="lcro)")]  
    DT[, list(mean(b)), keyby=Bin2]
    
     1:                                        NA  0.42510038
     2:     [-2.2146998871775, -1.05265747329296) -0.16938368
     3:   [-1.05265747329296, -0.613869230708978) -0.14103182
     4:  [-0.613869230708978, -0.375342019639661) -0.33474492
     5: [-0.375342019639661, -0.0767031289639095)  0.20827735
     6:  [-0.0767031289639095, 0.113909160788544)  0.14890251
     7:    [0.113909160788544, 0.377079930573521)  0.16685304
     8:    [0.377079930573521, 0.581217342522697)  0.07086979
     9:    [0.581217342522697, 0.771253599758546)  0.17950572
    10:      [0.771253599758546, 1.1810650775142) -0.04951607
    11:       [1.1810650775142, 2.40161776050478) -0.55591413
    

    指定您自己的显式垃圾箱(注意返回空垃圾箱)

    bin_data(dt=DT, binCol="a", bins=seq(-5, 5, 1), returnDT=TRUE)
    
              Bin         a           b
      1: [-5, -4)        NA          NA
      2: [-4, -3)        NA          NA
      3: [-3, -2) -2.214700 -0.65069635
      4: [-2, -1) -1.989352 -0.17955653
      5: [-2, -1) -1.470752 -0.03763417
     ---                               
    100:   [1, 2)  1.586833 -1.20808279
    101:   [2, 3)  2.401618  0.42510038
    102:   [2, 3)  2.172612  0.20753834
    103:   [3, 4)        NA          NA
    104:   [4, 5]        NA          NA
    

    使用可变大小的箱子

    bin_data(dt=DT, binCol="a", bins=data.table(LB=c(-5, 0, 1), RB=c(0, 1, Inf)), returnDT=TRUE)
    
              Bin            a           b
      1:  [-5, 0) -0.626453811 -0.62036668
      2:  [-5, 0) -0.835628612 -0.91092165
      3:  [-5, 0) -0.820468384  1.76728727
      4:  [-5, 0) -0.305388387  1.68217608
      5:  [-5, 0) -0.621240581  1.43228224
     ---                               
     95: [1, Inf]  2.172611670  0.20753834
     96: [1, Inf]  1.178086997  0.21992480
     97: [1, Inf]  1.063099837  1.46458731
     98: [1, Inf]  1.207867806  0.40201178
     99: [1, Inf]  1.160402616 -0.73174817
    100: [1, Inf]  1.586833455 -1.20808279
              Bin            a           b
    

    【讨论】:

    • 感谢@statquant 我构建它是为了解决不断出现的需求。请注意,它需要一些参数检查和错误处理,但只要您提供适当的输入,它就会起作用。
    • 我会使用它并回来提供建议/改进,,,,通常我会添加一个 by,这样你就可以通过子集化,然后做 bining(我一直这样做,就像 2014 ,2015,...)
    • @ben。看起来很有趣。我去看看mltools
    猜你喜欢
    • 2021-12-12
    • 1970-01-01
    • 2021-11-12
    • 2016-06-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-01-15
    • 1970-01-01
    相关资源
    最近更新 更多