【问题标题】:pandas group data at 3 month intervals and aggregate list of functionspandas 每隔 3 个月对数据进行一次分组,并汇总函数列表
【发布时间】:2022-11-14 16:52:14
【问题描述】:

我有一个如下所示的数据框

df = pd.DataFrame({'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
                   'invoice_id':[1,2,3,4,5,6,7,8,9,10,11,12],
                   'purchase_date' :['2017-04-03 12:35:00','2017-04-03 12:50:00','2018-04-05 12:59:00','2018-05-04 13:14:00','2017-05-05 13:37:00','2018-07-06 13:39:00','2018-07-08 11:30:00','2017-04-08 16:00:00','2019-04-09 22:00:00','2019-04-11 04:00:00','2018-04-13 04:30:00','2017-04-14 08:00:00'],
                   'val' :[5,5,5,5,1,6,5,5,8,3,4,6],
                   'Prod_id':['A1','A1','C1','A1','E1','Q1','G1','F1','G1','H1','J1','A1']})
df['purchase_date'] = pd.to_datetime(df['purchase_date'])

我想做以下

a) 每隔 3 个月按 subject_id 对数据进行分组(使用购买日期列)

b) 根据其他变量(如prod_idval等)计算每个组的统计数据,如均值、总和、唯一值和计数

例如:df 中最早的购买日期从2017-04-03 开始。这里数据集中的开始月份是四月。所以,我们从 4 月算起 3 个月。所以,APR, MAY and JUN 将是 M1July, Aug and Sept 将是 M2,依此类推。我们这样做是为了每隔 3 个月创建一次数据。 每当 3 个月之间没有数据时,我们将其设为零 (0)

所以,我根据在线研究尝试了以下类似的方法

    length_of_purchase_date = 10
    
    date_month_map = {
        str(x)[:length_of_purchase_date]: 'M_%s' % (i+1) for i, x in enumerate(
            sorted(data_df.reset_index()['purchase_date'].unique(), reverse=True)
        )
    } 
    df.reset_index().groupby(['subject_id',
    pd.Grouper(key='pruchase_date', freq='3M')
]).agg({
    'val': [sum, mean,count],
})

我希望我的输出如下所示(显示为 subject_id = 1)。请注意,我必须对具有数百万行的大数据执行此操作。

【问题讨论】:

    标签: python pandas dataframe group-by time-series


    【解决方案1】:

    利用:

    df = df.sort_values(['subject_id','purchase_date'])
    
    per = df['purchase_date'].dt.to_period('m').astype('int')
    df['date_group'] = (per.sub(per.min()) // 3 + 1)
    
    
    f = lambda x: x.mode().iat[0]
    df = df.groupby(['subject_id', 'date_group']).agg(max_date=('purchase_date','max'),
                                                      nunique=('Prod_id','nunique'),
                                                      count_prod_id=('Prod_id','count'),
                                                      sum_val=('val','sum'),
                                                      avg_val=('val','mean'),
                                                      min_val=('val','min'),
                                                      max_val=('val','max'),
                                                      Top1st_prod_id=('Prod_id',f))
    d = dict.fromkeys(df.columns.difference(['max_date','Top1st_prod_id']), 0)
    df = (df.reset_index(level=0)
             .groupby('subject_id')
             .apply(lambda x: x.reindex(range(1, x.index.max() + 1)))
             .fillna(d)) 
    
    df['max_date'] = df['max_date'].dt.strftime('%d-%b-%y')
    

    print (df)
                           subject_id   max_date  nunique  count_prod_id  sum_val  
    subject_id date_group                                                           
    1          1                  1.0  05-May-17      2.0            3.0     11.0   
               2                  NaN        NaN      0.0            0.0      0.0   
               3                  NaN        NaN      0.0            0.0      0.0   
               4                  NaN        NaN      0.0            0.0      0.0   
               5                  1.0  04-May-18      2.0            2.0     10.0   
               6                  1.0  08-Jul-18      2.0            2.0     11.0   
    2          1                  2.0  14-Apr-17      2.0            2.0     11.0   
               2                  NaN        NaN      0.0            0.0      0.0   
               3                  NaN        NaN      0.0            0.0      0.0   
               4                  NaN        NaN      0.0            0.0      0.0   
               5                  2.0  13-Apr-18      1.0            1.0      4.0   
               6                  NaN        NaN      0.0            0.0      0.0   
               7                  NaN        NaN      0.0            0.0      0.0   
               8                  NaN        NaN      0.0            0.0      0.0   
               9                  2.0  11-Apr-19      2.0            2.0     11.0   
    
                            avg_val  min_val  max_val Top1st_prod_id  
    subject_id date_group                                             
    1          1           3.666667      1.0      5.0             A1  
               2           0.000000      0.0      0.0            NaN  
               3           0.000000      0.0      0.0            NaN  
               4           0.000000      0.0      0.0            NaN  
               5           5.000000      5.0      5.0             A1  
               6           5.500000      5.0      6.0             G1  
    2          1           5.500000      5.0      6.0             A1  
               2           0.000000      0.0      0.0            NaN  
               3           0.000000      0.0      0.0            NaN  
               4           0.000000      0.0      0.0            NaN  
               5           4.000000      4.0      4.0             J1  
               6           0.000000      0.0      0.0            NaN  
               7           0.000000      0.0      0.0            NaN  
               8           0.000000      0.0      0.0            NaN  
               9           5.500000      3.0      8.0             G1  
    

    【讨论】:

      猜你喜欢
      • 2021-04-29
      • 1970-01-01
      • 1970-01-01
      • 2021-12-29
      • 2015-04-03
      • 1970-01-01
      • 2011-08-28
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多