在 pandas 中的 groupby 对象上引导答案

【问题标题】：bootstrap on a groupby object in pandas在 pandas 中的 groupby 对象上引导
【发布时间】：2014-10-17 16:20:20
【问题描述】：

我有多个时间序列，其中观察值作为行（由时间戳索引，n > 5000）和变量作为列（n = 419）。我在时间序列中选择 N 百分比的包含，然后调用 groupby 以按年分组。我想要的是平均值、标准差，然后是每年的 95% 置信区间。使用下面的代码，我可以很容易地得到 mean 和 std，但我需要调用一个单独的引导函数来获得每年和每个组的 95% CI：

这是分组数据的外观：（2013 年有 86 行，28 列，数据始于 1970 年代）。我需要为分组中的每一年分组的每一列使用“bootsrap”。

for year, group in grouped:
print year
print group

2013
                  101        102        103        104       105        109     
2013-04-02    3162.84    4136.02   77124.56       0.00    973.18    9731.81   
2013-04-04    1033.81    5464.44   87283.30    3692.19   4282.94     295.37   
2013-04-04     640.75    4164.87  131033.14    2563.00   1121.31     961.12   
2013-04-10     246.87    4196.84   88380.57    4443.72    493.75    1234.37   
2013-04-13       0.00    8300.49  114291.42   10003.16    212.83    6385.00

` 以下是我的 groupby 和 bootstrap 函数：

def gbY_20pct(nm): # sort into 20% timeseries inclusion, groupby year, take mean for year
        nm1=nm.replace('0', np.nan) # remove 0 for logical count
        coun=nm1.count(axis=0,numeric_only=True)
        pct=(coun/len(nm1)) *100
        pCount=pct.loc[pct >= 20]
        nm1=nm.loc[:, pCount.index]
        grouped = nm1.groupby(nm1.index.map(lambda x: x.year))
        yrly=grouped.mean().astype(int)
        yrly_coun=grouped.count().astype(int)
        yrly_std=grouped.std().astype(int)
        yrly_max=grouped.max().astype(int)
        yrM1=yrly.join(yrly_std, lsuffix=' mean', rsuffix=' std', how='outer')
        yrM2=yrly_max.join(yrly_coun, lsuffix=' max', rsuffix=' count', how='outer')
        data=yrM1.join(yrM2, how='outer')
        return data

import numpy as np
import numpy.random as npr  
def bootstrap(data, num_samples, statistic, alpha):
    """Returns bootstrap estimate of 100.0*(1-alpha) CI for statistic."""
    n = len(data)
    idx = npr.randint(0, n, (num_samples, n))
    samples = data[idx]
    stat = np.sort(statistic(samples, 1))
    return (stat[int((alpha/2.0)*num_samples)],
            stat[int((1-alpha/2.0)*num_samples)])

为了测试代码，我是手动调用的（分组已经定义，函数没有关闭）

from bootstrap import bootstrap
low, high = bootstrap(grouped, 100000, np.mean, 0.05)
Traceback (most recent call last):

  File "<ipython-input-49-cd362c7908d1>", line 1, in <module>
    low, high = bootstrap(grouped, 100000, np.mean, 0.05)

  File "bootstrap.py", line 14, in bootstrap

  File "C:\Users\ryan.morse\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2991, in __getitem__
    bad_keys = list(set(key).difference(self.obj.columns))

TypeError: unhashable type: 'numpy.ndarray'

问题来自samples = data[idx] 行。我怀疑我需要比在引导程序中对数据字段使用“分组”更具体，但我不确定如何做到这一点。我需要将其作为 lambda 函数应用吗？或者也许有一个 for 循环？任何建议将不胜感激。

查看此页面后：Pandas, compute many means with bootstrap confidence intervals for plotting 并尝试了 scikit 引导函数 https://scikits.appspot.com/bootstrap，我测试了上面定义的函数，发现它的速度要快得多，结果可比。

编辑：

我认为这样的事情可能会起作用，但我仍然无法获得正确的语法：

groups=dict(list(grouped)) # this allows me to visualize the data and call values

for key, value in groups.iteritems():
low_i, high_i = bootstrap(groups.values(), 100000, np.mean, 0.05) 

Traceback (most recent call last):

  File "<ipython-input-36-7a8e261d656e>", line 2, in <module>
    low_i, high_i=bootstrap(groups.values(), 10000, np.mean, 0.05)

  File "<ipython-input-15-3ce4acd651dc>", line 7, in bootstrap
    samples = data[idx]

TypeError: only integer arrays with one element can be converted to an index

我不确定如何为引导函数调用“数据”，以及如何迭代所有年份并在所有年份中保持低位和高位（在同一个数据帧中或在 2 个单独的数据帧中）。

任何帮助将不胜感激...

编辑 2 我可以很容易地使用 lambda 函数，但是我似乎无法获得正确的输出：

for col, group in nm1.groupby(nm1.index.year):
    lo,hi=bootstrap(group,1000, np.mean, 0.05)

lo
Out[117]: 
array([ 0.05713616,  0.30724739,  0.39592714,  0.55113183,  0.68623155,
        0.69493923,  0.73513661,  0.84086099,  0.85882618,  0.86698939,
        0.99399694,  1.04415927,  1.06553914,  1.11306698,  1.15344871,
        1.27943327,  1.43275895,  1.81076036,  2.21647657,  2.37724615,
        2.39004626,  2.43154256,  2.89940325,  3.02234954,  3.30773642,
        3.96535146,  3.98973744,  4.38873853])

hi
Out[118]: 
array([ 0.20584822,  0.38832222,  0.42140066,  0.48615202,  0.59686031,
        0.67388397,  0.84269082,  0.84532503,  0.87078368,  0.9033272 ,
        0.90765817,  0.97523759,  0.99186096,  1.01668772,  1.06681722,
        1.18205259,  1.38524423,  1.79908484,  2.22314773,  2.33789105,
        2.5521743 ,  2.64242269,  2.88851233,  2.94387756,  3.44294791,
        3.63914938,  3.99185026,  4.36450246])

如果这行得通，我将在 33 年的每一年中对 28 列中的每一列都有 lo 和 hi，而不是我有一个似乎没有任何实际价值的有序数字数组......这是一个yrly 的 sn-p 包含每年的对数转换 groupby 均值，与上述数组不同，自举 CI 应该接近这些数字。

           101       102       103       104       105       109       135  
1978  3.416638  3.701268  3.828442  2.911944  2.687491  2.076515  1.232035   
1979  2.710939  3.172061  4.234109  1.666818  3.390646  1.355179  3.003813   
1980  2.652617  2.375495  3.316380  1.101594  2.220028  1.195449  1.998862   
1981  3.363424  3.485015  3.441784  2.242618  2.256745  1.719140  1.150454   
1982  2.791865  2.019883  4.093960  1.038226  2.106627  1.180935  2.456144   
1983  2.597307  2.213450  4.458691  1.274352  2.820910  1.705242  3.452762   
1984  3.042197  4.023952  3.816964  2.499883  2.445258  1.769485  1.690180   
1985  2.669850  2.162608  3.600731  1.400102  1.845218  1.234235  2.517108   
1986  3.597527  2.763436  2.790792  1.410343  2.116275  1.042812  1.528532

【问题讨论】：

标签： python pandas group-by statistics-bootstrap

【解决方案1】：

毕竟，我想出的答案是：

import scipy.stats
ci = grouped.aggregate(lambda x: scipy.stats.sem(x, ddof=1) * 1.96) #use mean +(-) ci to get 95% conf interval

事实证明，我真的不需要引导数据，因此我可以根据平均值的标准误差乘以正态分布的 0.975 分位数来估计 95% 的置信区间，假设数据是正常的分布式（但这是另一个问题......）。

           101       102       103       104       105       109       135
1978  0.230630  0.191651  0.168648  0.282588  0.237939  0.288924  0.257476   
1979  0.192579  0.147305  0.120740  0.225826  0.145646  0.266530  0.199315   
1980  0.189258  0.195263  0.182756  0.166479  0.166401  0.172550  0.189483   
1981  0.200727  0.169663  0.184478  0.232392  0.198591  0.230457  0.194084   
1982  0.271740  0.267881  0.164450  0.248718  0.246636  0.260973  0.253430   
1983  0.253495  0.279114  0.116744  0.266888  0.236672  0.317195  0.155766

【讨论】：