【问题标题】:Pandas - Expanding average session timePandas - 扩大平均会话时间
【发布时间】:2021-04-10 03:33:09
【问题描述】:

以下 DF 表示从用户接收到的事件。用户id和事件时间戳:

    id           timestamp
0    1 2020-09-01 18:14:35
1    1 2020-09-01 18:14:39
2    1 2020-09-01 18:14:40
3    1 2020-09-01 02:09:22
4    1 2020-09-01 02:09:35
5    1 2020-09-01 02:09:53
6    1 2020-09-01 02:09:57
7    2 2020-09-01 18:14:35
8    2 2020-09-01 18:14:39
9    2 2020-09-01 18:14:40
10   2 2020-09-01 02:09:22
11   2 2020-09-01 02:09:35
12   2 2020-09-01 02:09:53
13   2 2020-09-01 02:09:57

我想获得平均扩展会话时间。会话定义为中断超过 5 分钟的事件序列。

我将会话分组如下:

df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])

并且得到了正确的组:

   id           timestamp
3   1 2020-09-01 02:09:22
4   1 2020-09-01 02:09:35
5   1 2020-09-01 02:09:53
6   1 2020-09-01 02:09:57
   id           timestamp
0   1 2020-09-01 18:14:35
1   1 2020-09-01 18:14:39
2   1 2020-09-01 18:14:40
    id           timestamp
10   2 2020-09-01 02:09:22
11   2 2020-09-01 02:09:35
12   2 2020-09-01 02:09:53
13   2 2020-09-01 02:09:57
   id           timestamp
7   2 2020-09-01 18:14:35
8   2 2020-09-01 18:14:39
9   2 2020-09-01 18:14:40

现在我想计算任何给定行中每个用户的平均会话时间(以秒为单位),因此输出为:

    id           timestamp  avg_session_time
0    1 2020-09-01 18:14:35  0 <-- first event
1    1 2020-09-01 18:14:39  4 <-- 2nd event after 4 seconds
2    1 2020-09-01 18:14:40  5 <-- 3rd event after 5 seconds
--- session end
3    1 2020-09-01 02:09:22  5 <-- first event of second session
4    1 2020-09-01 02:09:35  9 <-- 2nd event after 13 seconds (13 seconds in the 2nd session + 5 in first session divide by the number of sessions 2)
5    1 2020-09-01 02:09:53  18 <-- 3rd event after 31 seconds ((31 + 5) / 2 = 18)
6    1 2020-09-01 02:09:57  20 <-- 4th event after 35 seconds ((35 + 5) / 2 = 20)
---
7    2 2020-09-01 18:14:35  0
8    2 2020-09-01 18:14:39  4
9    2 2020-09-01 18:14:40  5
---
10   2 2020-09-01 02:09:22  5
11   2 2020-09-01 02:09:35  9
12   2 2020-09-01 02:09:53  18
13   2 2020-09-01 02:09:57  20

任何帮助都会很棒:)

【问题讨论】:

    标签: pandas group-by mean timedelta


    【解决方案1】:

    用途:

    #converting to datetimes
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    
    #grouping per 5Min and id
    g = df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])
    #get first values per groups to new column
    df['diff'] = g['timestamp'].transform('first')
    #subtract by timestamp and convert timedeltas to seconds
    df['diff'] = df['timestamp'].sub(df['diff']).dt.total_seconds()
    #shifting per groups by id
    df['new'] = df.groupby('id')['diff'].shift()
    #get first value per groups, now shifted
    df['new'] = g['new'].transform('first')
    #replace 0 to misisng values and get average
    df['last'] = df[['new','diff']].replace(0, np.nan).mean(axis=1).fillna(df['new'])
    
    print (df)
        id           timestamp  diff  new  last
    0    1 2020-09-01 18:14:35   0.0  0.0   0.0
    1    1 2020-09-01 18:14:39   4.0  0.0   4.0
    2    1 2020-09-01 18:14:40   5.0  0.0   5.0
    3    1 2020-09-01 02:09:22   0.0  5.0   5.0
    4    1 2020-09-01 02:09:35  13.0  5.0   9.0
    5    1 2020-09-01 02:09:53  31.0  5.0  18.0
    6    1 2020-09-01 02:09:57  35.0  5.0  20.0
    7    2 2020-09-01 18:14:35   0.0  0.0   0.0
    8    2 2020-09-01 18:14:39   4.0  0.0   4.0
    9    2 2020-09-01 18:14:40   5.0  0.0   5.0
    10   2 2020-09-01 02:09:22   0.0  5.0   5.0
    11   2 2020-09-01 02:09:35  13.0  5.0   9.0
    12   2 2020-09-01 02:09:53  31.0  5.0  18.0
    13   2 2020-09-01 02:09:57  35.0  5.0  20.0
    

    【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2012-06-06
    • 1970-01-01
    • 2021-11-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-02-12
    相关资源
    最近更新 更多