Numpy 选项可更快获得结果:
将每小时的交易量获取到一个 numpy 数组中:
hourly_vol = stock_df.groupby(pd.Grouper(freq='H', level=0)).sum()['qty'].to_numpy()
获取 24 小时多个滴答声……在我的情况下:
h = hourly_vol[22:-16]
现在我们得到 (24 * n) 行,将数据分成 24 维行:
a = h.to_numpy().reshape(-1,24)
获取每天的总成交量:
dsum = a.sum(axis=1)
广播到 24 维数组:
b = np.array([dsum]*24).transpose() # maybe this get a while
获取结果:
result = a/b
并重塑以插入原始数据框:
result = result.reshape(240)
注意:请记住,在这种情况下,我在开始时删除了 16 和 22 其他行,然后我需要将结果插入到原始数据帧中:
df.iloc[22:-16]['result'] = result
Pandas 解决方案(不适用于非常大的数据集):
熊猫简答:
daily_vol = stock_df.groupby(pd.Grouper(freq='D', level=0)).sum()['qty']
hourly_vol = stock_df.groupby(pd.Grouper(freq='H', level=0)).sum()['qty']
totals_col = daily_vol[pd.date_range("2020-06-04 02:00", "2020-06-15 15:00", freq="60min")].fillna(method='ffill').fillna(method='bfill')
result = hourly_vol/totals_col
解释:
我们得到了这样的分时数据,但需要时间索引(来自 binance.com BTC/USDT 的示例):
df.head(3):
id price qty quoteQty time isBuyerMaker isBestMatch grouper tick_rule dollar_bt abs_theta
0 334736000 9663.87 0.015233 147.209732 2020-06-04 02:37:29.688 False True 0.0 0.0 -147.209732 2.557702e+08
1 334736001 9663.51 0.004417 42.683724 2020-06-04 02:37:29.805 True True 0.0 0.0 -42.683724 2.557701e+08
2 334736002 9663.73 0.016810 162.447301 2020-06-04 02:37:29.813 False True 0.0 1.0 162.447301 2.557703e+08
获取时间索引:
df['time'] = pd.to_datetime(df['time'], unit='ms')
stock_df = df.set_index('time')
每日总成交量:
daily_vol = stock_df.groupby(pd.Grouper(freq='D', level=0)).sum()['qty']
time
2020-06-04 53696.704657
2020-06-05 47788.050050
2020-06-06 32752.950893
2020-06-07 57952.848385
2020-06-08 40664.664125
2020-06-09 46024.001289
2020-06-10 47130.762982
2020-06-11 94418.984730
2020-06-12 50119.066932
2020-06-13 27759.784851
2020-06-14 30055.506608
2020-06-15 57688.820941
Freq: D, Name: qty, dtype: float64
每小时总成交量:
hourly_vol = stock_df.groupby(pd.Grouper(freq='H', level=0)).sum()['qty']
time
2020-06-04 02:00:00 447.253335
2020-06-04 03:00:00 1631.115302
2020-06-04 04:00:00 1703.933586
2020-06-04 05:00:00 1165.990115
2020-06-04 06:00:00 1441.345409
...
2020-06-15 11:00:00 2492.983349
2020-06-15 12:00:00 1971.762135
2020-06-15 13:00:00 3724.376480
2020-06-15 14:00:00 4531.290738
2020-06-15 15:00:00 811.775574
Freq: H, Name: qty, Length: 278, dtype: float64
要在一天中的一个小时内得到 pct,我们需要在其他计算之后得到每个小时行中的每日总数:
totals_col = daily_vol[pd.date_range("2020-06-04 02:00", "2020-06-15 15:00", freq="60min")].fillna(method='ffill').fillna(method='bfill')
2020-06-04 02:00:00 47788.050050
2020-06-04 03:00:00 47788.050050
2020-06-04 04:00:00 47788.050050
2020-06-04 05:00:00 47788.050050
2020-06-04 06:00:00 47788.050050
...
2020-06-15 11:00:00 57688.820941
2020-06-15 12:00:00 57688.820941
2020-06-15 13:00:00 57688.820941
2020-06-15 14:00:00 57688.820941
2020-06-15 15:00:00 57688.820941
Freq: 60T, Name: qty, Length: 278, dtype: float64
并且可以计算出一天中一个小时的百分比:
hourly_vol/totals_col
time
2020-06-04 02:00:00 0.009359
2020-06-04 03:00:00 0.034132
2020-06-04 04:00:00 0.035656
2020-06-04 05:00:00 0.024399
2020-06-04 06:00:00 0.030161
...
2020-06-15 11:00:00 0.043214
2020-06-15 12:00:00 0.034179
2020-06-15 13:00:00 0.064560
2020-06-15 14:00:00 0.078547
2020-06-15 15:00:00 0.014072
Freq: H, Name: qty, Length: 278, dtype: float64