Pandas：通过对重采样窗口内前 n 个百分位数的数据进行加权平均来重采样数据答案

【问题标题】：Pandas: Resample data by taking a weighted average of the top nth percentile of data within the resample windowPandas：通过对重采样窗口内前 n 个百分位数的数据进行加权平均来重采样数据
【发布时间】：2019-12-27 06:01:24
【问题描述】：

我有以下数据

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

date_today = pd.Timestamp(1513393355.5, unit='s')
days = pd.date_range(date_today, date_today + timedelta(1), freq='s')

np.random.seed(seed=1111)
data_price = np.random.randint(2, high=10, size=len(days))
data_quantity = np.random.randint(2, high=100, size=len(days))

df = pd.DataFrame({'ts': days, 'price': data_price, 'quantity': data_quantity})
df = df.set_index('ts')
print(df.head())

                         price  quantity
ts                                      
2017-12-16 03:02:35.500      6        30
2017-12-16 03:02:36.500      9        18
2017-12-16 03:02:37.500      7        85
2017-12-16 03:02:38.500      3        51
2017-12-16 03:02:39.500      6        19

我想将数据重新采样为10分钟间隔，按价格升序排列每个10分钟窗口内的观察值，排序后取前20%的数据，然后计算加权平均价格（即价格加权按数量），以及前 20% 数据的数量之和。

有一个解决方案here 使用 groupby 函数来计算加权平均价格。但我想将加权平均值应用于前 20% 的数据。

我想在静态的基础上（即应用 pandas 重新采样功能）并以滚动方式每 1 分钟执行一次，并有 10 分钟的回溯期。

如何使用 pandas 优雅地做到这一点？我对如何在重采样窗口中进行排名感到困惑。

谢谢！

【问题讨论】：

您是在寻找滚动的 10 分钟间隔，还是固定的，只是将数据分成 10 分钟的片段？
@calestini 实际上两者的解决方案将不胜感激。我会更新问题

标签： pandas aggregate pandas-groupby resampling

【解决方案1】：

这是一次尝试。我使用了 10 分钟的滚动窗口，因此当前值将代表过去 10 分钟的任何内容。为了演示，我改为 10 秒，以便更容易验证计算。

逻辑是：

过滤窗口中前 20% 的最高价格
计算过滤数据的加权平均值（qty_pct * 价格的总和）
注意：如果我们有1-4个obs，它会使用最高的值，从5-9，仍然是最高的（1>20%），10-14，2个obs（2>20% ) 等。

编辑：意识到我计算的是最高分位数，而不是前 20% 的观察值。下面保留原版，这里是更正版：

def top_obs_wavg(s):
    ## greater than 20% of obs > valid observation
    if len(s) <  5: # not enought for 20%, keep the largest
        valid_index =s.nlargest(1).index
    else:
        valid_index = s.nlargest(len(s)//5).index ## keep all above 20%

    ## filter pct_qty of tot_qty for the window, only those for top price quantile (>20%)
    pct_qty = df.loc[valid_index,'quantity']/np.sum(df.loc[valid_index,'quantity'])

    ## return the sum of the valid percentages * valid prices > weigthed average.
    return np.sum(pct_qty*s[valid_index])

df['t20_wavg'] = df.rolling('10s')['price'].apply(top_obs_wavg, raw=False)

输出

                       price    quantity    t20_wavg
ts          
2017-12-16 03:02:35.500     6   30          6.000000
2017-12-16 03:02:36.500     9   18          9.000000
2017-12-16 03:02:37.500     7   85          9.000000
2017-12-16 03:02:38.500     3   51          9.000000
2017-12-16 03:02:39.500     6   19          9.000000
2017-12-16 03:02:40.500     4   72          9.000000
2017-12-16 03:02:41.500     6   47          9.000000
2017-12-16 03:02:42.500     2   64          9.000000
2017-12-16 03:02:43.500     8   21          9.000000
2017-12-16 03:02:44.500     6   46          8.461538
2017-12-16 03:02:45.500     5   40          8.461538
2017-12-16 03:02:46.500     8   13          8.000000
2017-12-16 03:02:47.500     2   99          8.000000
2017-12-16 03:02:48.500     8   19          8.000000
2017-12-16 03:02:49.500     6   60          8.000000

使用分位数

def top_quantile_wavg(s):
    ## greater than 20% quantile > valid observation
    is_valid = s >= s.quantile()
    valid_index = s.index[is_valid]

    ## filter pct_qty of tot_qty for the window, only those for top price quantile (>20%)
    pct_qty = df.loc[valid_index,'quantity']/np.sum(df.loc[valid_index,'quantity'])

    ## return the sum of the valid percentages * valid prices > weigthed average.
    return np.sum(pct_qty*s[valid_index])

那么我们就可以使用 pandas 滚动类了：

## change to 10T for 10 minutes
df['t20_wavg'] = df.rolling('10s')['price'].apply(top_quantile_wavg, raw=False)

输出

                          price     quantity    t20_wavg
ts          
2017-12-16 03:02:35.500     6       30          6.000000
2017-12-16 03:02:36.500     9       18          9.000000
2017-12-16 03:02:37.500     7       85          7.349515
2017-12-16 03:02:38.500     3       51          7.349515
2017-12-16 03:02:39.500     6       19          6.914474
2017-12-16 03:02:40.500     4       72          6.914474
2017-12-16 03:02:41.500     6       47          6.698492
2017-12-16 03:02:42.500     2       64          6.698492
2017-12-16 03:02:43.500     8       21          6.822727

【讨论】：