【问题标题】:Resample dataframe with specific start/end dates along with a groupby使用特定的开始/结束日期以及 groupby 重新采样数据框
【发布时间】:2019-07-29 16:42:16
【问题描述】:

我有一些看起来像这样的交易数据。

import pandas as pd
from io import StringIO
from datetime import datetime
from datetime import timedelta

data = """\
cust_id,datetime,txn_type,txn_amt
100,2019-03-05 6:30,Credit,25000
100,2019-03-06 7:42,Debit,4000
100,2019-03-07 8:54,Debit,1000
101,2019-03-05 5:32,Credit,25000
101,2019-03-06 7:13,Debit,5000
101,2019-03-06 8:54,Debit,2000
"""

df = pd.read_table(StringIO(data), sep=',')
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %H:%M:%S')
# use datetime as the dataframe index
df = df.set_index('datetime')
print(df)

                    cust_id txn_type  txn_amt
datetime                                      
2019-03-05 06:30:00      100   Credit    25000
2019-03-06 07:42:00      100    Debit     4000
2019-03-07 08:54:00      100    Debit     1000
2019-03-05 05:32:00      101   Credit    25000
2019-03-06 07:13:00      101    Debit     5000
2019-03-06 08:54:00      101    Debit     2000

我想对cust_idtxn_type 的每个组合在每日级别聚合(求和)txn_amount 的数据重新采样。同时,我想将索引标准化为5天(目前数据只包含3天的数据)。本质上,这就是我想要制作的:

             cust_id txn_type  txn_amt
datetime    
2019-03-03    100    Credit   0
2019-03-03    100    Debit    0
2019-03-03    101    Credit   0
2019-03-03    101    Debit    0
2019-03-04    100    Credit   0
2019-03-04    100    Debit    0
2019-03-04    101    Credit   0
2019-03-04    101    Debit    0
2019-03-05    100    Credit   25000
2019-03-05    100    Debit    0
2019-03-05    101    Credit   25000
2019-03-05    101    Debit    0
2019-03-06    100    Credit   0
2019-03-06    100    Debit    4000
2019-03-06    101    Credit   0
2019-03-06    101    Debit    7000   => (note: aggregated value)
2019-03-07    100    Credit   0
2019-03-07    100    Debit    1000
2019-03-07    101    Credit   0
2019-03-07    101    Debit    0

到目前为止,我已经尝试创建一个新的日期时间索引并尝试重新采样,然后像这样使用新创建的索引:

# create a 5 day datetime index
end_dt = max(df.index).to_pydatetime().strftime('%Y-%m-%d')
start_dt = max(df.index) - timedelta(days=4)
start_dt = start_dt.to_pydatetime().strftime('%Y-%m-%d')
dt_index = pd.date_range(start=start_dt, end=end_dt, freq='1D', name='datetime')

但是,我不确定如何进行分组部分。不分组重采样输出错误结果:

# resample timeseries so that one row is 1 day's worth of txns
df2 = df.resample(rule='D').sum().reindex(dt_index).fillna(0)
print(df2)
            cust_id  txn_amt
datetime                    
2019-03-03      0.0      0.0
2019-03-04      0.0      0.0
2019-03-05    201.0  50000.0
2019-03-06    302.0  11000.0
2019-03-07    100.0   1000.0

那么,我如何在重采样时合并cust_idtsn_type 的分组?我见过this similar question,但是op的数据结构不同。

【问题讨论】:

    标签: pandas pandas-groupby


    【解决方案1】:

    我这里用的是reindex,关键是要设置Multiple索引

    df.index=pd.to_datetime(df.index).date
    df=df.groupby([df.index,df['txn_type'],df['cust_id']]).agg({'txn_amt':'sum'}).reset_index(level=[1,2])
    drange=pd.date_range(end=df.index.max(),periods =5)
    idx=pd.MultiIndex.from_product([drange,df.cust_id.unique(),df.txn_type.unique()])
    Newdf=df.set_index(['cust_id','txn_type'],append=True).reindex(idx,fill_value=0).reset_index(level=[1,2])
    Newdf
    Out[749]: 
                level_1 level_2  txn_amt
    2019-03-03      100  Credit        0
    2019-03-03      100   Debit        0
    2019-03-03      101  Credit        0
    2019-03-03      101   Debit        0
    2019-03-04      100  Credit        0
    2019-03-04      100   Debit        0
    2019-03-04      101  Credit        0
    2019-03-04      101   Debit        0
    2019-03-05      100  Credit    25000
    2019-03-05      100   Debit        0
    2019-03-05      101  Credit    25000
    2019-03-05      101   Debit        0
    2019-03-06      100  Credit        0
    2019-03-06      100   Debit     4000
    2019-03-06      101  Credit        0
    2019-03-06      101   Debit     7000
    2019-03-07      100  Credit        0
    2019-03-07      100   Debit     1000
    2019-03-07      101  Credit        0
    2019-03-07      101   Debit        0
    

    【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-09-29
    • 1970-01-01
    • 2017-06-10
    • 2018-02-05
    • 2014-03-02
    • 1970-01-01
    相关资源
    最近更新 更多