【问题标题】:Pandas: Average value for the past n daysPandas:过去 n 天的平均值
【发布时间】:2016-08-26 10:26:35
【问题描述】:

我有一个像这样的Pandas 数据框:

test = pd.DataFrame({ 'Date' : ['2016-04-01','2016-04-01','2016-04-02',
                             '2016-04-02','2016-04-03','2016-04-04',
                             '2016-04-05','2016-04-06','2016-04-06'],
                      'User' : ['Mike','John','Mike','John','Mike','Mike',
                             'Mike','Mike','John'],
                      'Value' : [1,2,1,3,4.5,1,2,3,6]
                })

如下所示,数据集不一定每天都有观测值:

         Date  User  Value
0  2016-04-01  Mike    1.0
1  2016-04-01  John    2.0
2  2016-04-02  Mike    1.0
3  2016-04-02  John    3.0
4  2016-04-03  Mike    4.5
5  2016-04-04  Mike    1.0
6  2016-04-05  Mike    2.0
7  2016-04-06  Mike    3.0
8  2016-04-06  John    6.0

如果至少有一天可用,我想添加一个新列,该列显示每个用户过去 n 天的平均值(在本例中 n = 2),否则它将具有 nan 值。例如,在2016-04-06 上,John 得到nan,因为他没有2016-04-052016-04-04 的数据。所以结果会是这样的:

         Date  User  Value  Value_Average_Past_2_days
0  2016-04-01  Mike    1.0                        NaN
1  2016-04-01  John    2.0                        NaN
2  2016-04-02  Mike    1.0                       1.00
3  2016-04-02  John    3.0                       2.00
4  2016-04-03  Mike    4.5                       1.00
5  2016-04-04  Mike    1.0                       2.75
6  2016-04-05  Mike    2.0                       2.75
7  2016-04-06  Mike    3.0                       1.50
8  2016-04-06  John    6.0                        NaN

看了论坛里的几篇帖子,好像应该是group_by和自定义rolling_mean的组合,但是我不太明白怎么做。

【问题讨论】:

  • 您使用的是哪个版本的 Pandas? pd.__version__

标签: python pandas time-series aggregation


【解决方案1】:
n = 2

# Cast your dates as timestamps.
test['Date'] = pd.to_datetime(test.Date)

# Create a daily index spanning the range of the original index.
idx = pd.date_range(test.Date.min(), test.Date.max(), freq='D')

# Pivot by Dates and Users.
df = test.pivot(index='Date', values='Value', columns='User').reindex(idx)
>>> df.head(3)
User        John  Mike
2016-04-01     2   1.0
2016-04-02     3   1.0
2016-04-03   NaN   4.5

# Apply a rolling mean on the above dataframe and reset the index.
df2 = (pd.rolling_mean(df.shift(), n, min_periods=1)
       .reset_index()
       .drop_duplicates())

# For Pandas 0.18.0+
df2 = (df.shift().rolling(window=n, min_periods=1).mean()
       .reset_index()
       .drop_duplicates())

# Melt the result back into the original form.
df3 = (pd.melt(df2, id_vars='Date', value_name='Value')
       .sort_values(['Date', 'User'])
       .reset_index(drop=True))
>>> df3.head()
        Date  User  Value
0 2016-04-01  John    NaN
1 2016-04-01  Mike    NaN
2 2016-04-02  John    2.0
3 2016-04-02  Mike    1.0
4 2016-04-03  John    2.5

# Merge the results back into the original dataframe.
>>> test.merge(df3, on=['Date', 'User'], how='left', 
               suffixes=['', '_Average_past_{0}_days'.format(n)])

        Date  User  Value  Value_Average_past_2_days
0 2016-04-01  Mike    1.0                        NaN
1 2016-04-01  John    2.0                        NaN
2 2016-04-02  Mike    1.0                       1.00
3 2016-04-02  John    3.0                       2.00
4 2016-04-03  Mike    4.5                       1.00
5 2016-04-04  Mike    1.0                       2.75
6 2016-04-05  Mike    2.0                       2.75
7 2016-04-06  Mike    3.0                       1.50
8 2016-04-06  John    6.0                        NaN

总结

n = 2
test['Date'] = pd.to_datetime(test.Date)
idx = pd.date_range(test.Date.min(), test.Date.max(), freq='D')
df = test.pivot(index='Date', values='Value', columns='User').reindex(idx)
df2 = (pd.rolling_mean(df.shift(), n, min_periods=1)
       .reset_index()
       .drop_duplicates())
df3 = (pd.melt(df2, id_vars='Date', value_name='Value')
       .sort_values(['Date', 'User'])
       .reset_index(drop=True))
test.merge(df3, on=['Date', 'User'], how='left', 
           suffixes=['', '_Average_past_{0}_days'.format(n)])

【讨论】:

    【解决方案2】:

    我认为您可以先使用转换列Dateto_datetime,然后通过groupbyresample 找到缺少的Days,最后是apply rolling

    test['Date'] = pd.to_datetime(test['Date'])
    
    df = test.groupby('User').apply(lambda x: x.set_index('Date').resample('1D').first())
    print df
                     User  Value
    User Date                   
    John 2016-04-01  John    2.0
         2016-04-02  John    3.0
         2016-04-03   NaN    NaN
         2016-04-04   NaN    NaN
         2016-04-05   NaN    NaN
         2016-04-06  John    6.0
    Mike 2016-04-01  Mike    1.0
         2016-04-02  Mike    1.0
         2016-04-03  Mike    4.5
         2016-04-04  Mike    1.0
         2016-04-05  Mike    2.0
    
    df1 = df.groupby(level=0)['Value']
            .apply(lambda x: x.shift().rolling(min_periods=1,window=2).mean())
            .reset_index(name='Value_Average_Past_2_days')
    
    print df1
        User       Date  Value_Average_Past_2_days
    0   John 2016-04-01                        NaN
    1   John 2016-04-02                       2.00
    2   John 2016-04-03                       2.50
    3   John 2016-04-04                       3.00
    4   John 2016-04-05                        NaN
    5   John 2016-04-06                        NaN
    6   Mike 2016-04-01                        NaN
    7   Mike 2016-04-02                       1.00
    8   Mike 2016-04-03                       1.00
    9   Mike 2016-04-04                       2.75
    10  Mike 2016-04-05                       2.75
    11  Mike 2016-04-06                       1.50
    
    print pd.merge(test, df1, on=['Date', 'User'], how='left')
            Date  User  Value  Value_Average_Past_2_days
    0 2016-04-01  Mike    1.0                        NaN
    1 2016-04-01  John    2.0                        NaN
    2 2016-04-02  Mike    1.0                       1.00
    3 2016-04-02  John    3.0                       2.00
    4 2016-04-03  Mike    4.5                       1.00
    5 2016-04-04  Mike    1.0                       2.75
    6 2016-04-05  Mike    2.0                       2.75
    7 2016-04-06  Mike    3.0                       1.50
    8 2016-04-06  John    6.0                        NaN
    

    【讨论】:

    • 谢谢。它用词完美。这段代码是否有小的修改,以便我可以提取过去 n 天内的观察次数?也许使用rolling(min_periods=1,window=n).sum(~is.null()) 而不是rolling(min_periods=1,window=2).mean()
    • 我觉得你需要rolling(min_periods=1,window=2).count()
    • 这绝对是一个很好的答案,但是有没有办法分几步完成,因为我有类似的问题,而且我有一年的数据,所以每一步都非常耗时?
    【解决方案3】:

    使用 groupby 计算 30 天/1 个月的滚动平均值

    df_px = df_px.set_index(pd.to_datetime(df_px['date']))
    df_px['px_avg30d']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling('30D').mean())
    

    可重现的例子

    
    import pandas_datareader as pddr
    
    df = pddr.DataReader(['CAT','WMT'], 'yahoo', datetime.date(2021,6,30), datetime.date(2022,1,1))
    
    df_px = df['Adj Close'].copy()
    df_px = df_px.resample('W-MON').first()
    df_px = df_px.sample(frac=0.33, random_state=0).sort_index()
    df_px['date']=df_px.index.astype(str).str[:10]
    df_px = df_px.melt(id_vars=['date'])
    df_px.columns = ['date','stock','px']
    df_px = df_px.set_index(pd.to_datetime(df_px['date']))
    df_px['px_avg30d']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling('30D').mean())
    df_px['px_avg3']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling(3,min_periods=1).mean())
    df_px['px_avg4']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling(4,min_periods=1).mean())
    
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-07-01
      • 1970-01-01
      • 2021-05-19
      • 2020-09-05
      • 2016-04-03
      • 1970-01-01
      • 1970-01-01
      • 2021-04-19
      相关资源
      最近更新 更多