【问题标题】:Filtering a panda dataframe based on value and time根据值和时间过滤熊猫数据框
【发布时间】:2016-04-08 22:03:57
【问题描述】:

我有一个这样的熊猫数据框

2011-5-5 12:43               noEvent          CarA      otherColumns...
2011-5-5 12:45               noEvent          CarA          ...
2011-5-5 12:49               EVENT            CarA          ...
2011-5-5 12:51               noEvent          CarA          ...
(no data - jumps in time)
2011-5-6 12:52               EVENT            CarA          ...
2011-5-6 12:59               noEvent          CarA          ...
2011-5-6 13:00               noEvent          CarA          ...
2011-5-5 12:43               noEvent          CarB          ...
2011-5-5 12:45               noEvent          CarB          ...
2011-5-5 12:49               noEvent          CarB          ...
2011-5-5 12:51               noEvent          CarB          ...
(no data - jumps in time)
2011-5-6 12:52               noEvent          CarB          ...
2011-5-6 12:52               EVENT            CarB          ...
2011-5-6 13:00               noEvent          CarB          ...

解释:

  • 时间戳列不是线性间隔的
  • 有 2 辆车,A 和 B。来自 A 的事件独立于 B 的事件

我需要在事件发生前后+-2 分钟针对每辆车执行一些计算。

为此,我很困惑...如何过滤此数据框?

想要的结果应该是这样的

-2min
2011-5-5 12:49               EVENT            CarA          ...
+2min

-2min
2011-5-6 12:52               EVENT            CarA          ...
+2min

-2min
2011-5-6 12:52               EVENT            CarB          ...
+2min

一些信息:

  • 您不能混合来自 CarA 和 CarB 的事件
  • 未来汽车的数量可能会达到数十万辆

我不知道从哪里开始..

  • 我可以使用哪些功能?
  • 如何将事件分组到“块”中,以便分别处理每 4 分钟的记录块?

【问题讨论】:

    标签: python pandas filter


    【解决方案1】:

    首先按 Car 列分组,并按如下方式处理每个组:

    先创建测试数据:

    import pandas as pd
    import numpy as np
    
    np.random.seed(1)
    idx = pd.date_range("2016-03-01 10:00:00", "2016-03-01 20:00:00", freq="S")
    idx = idx[np.random.randint(0, len(idx), 10000)].sort_values()
    evt = np.array(["no event", "event"])[(np.random.rand(len(idx)) < 0.0005).astype(int)]
    df = pd.DataFrame({"event":evt, "value":np.random.randint(0, 10, len(evt))}, index=idx)
    

    找到事件行和+/- 10秒的行索引:

    event_time = df.index[df.event == "event"]
    delta = pd.Timedelta(10, unit="s")
    
    start_idx = df.index.searchsorted(event_time - delta).tolist()
    end_idx = df.index.searchsorted(event_time + delta).tolist()
    

    创建掩码数组:

    mask = np.zeros(df.shape[0], dtype=bool)
    evt_id = np.zeros(df.shape[0], dtype=int)
    for i, (s, e) in enumerate(zip(start_idx, end_idx)):
        mask[s:e] = True
        evt_id[s:e] = i
    

    使用掩码数组过滤行,这里我创建一个 event_id 列来对事件进行分组:

    df_event = df[mask]
    df_event["event_id"] = evt_id[mask]
    

    输出:

                            event  value  event_id
    2016-03-01 13:51:48  no event      0         0
    2016-03-01 13:51:51     event      8         0
    2016-03-01 13:51:53  no event      3         0
    2016-03-01 13:52:00  no event      1         0
    2016-03-01 14:21:00  no event      2         1
    2016-03-01 14:21:00  no event      5         1
    2016-03-01 14:21:00  no event      0         1
    2016-03-01 14:21:02  no event      1         1
    2016-03-01 14:21:04  no event      2         1
    2016-03-01 14:21:06  no event      0         1
    2016-03-01 14:21:07     event      1         1
    2016-03-01 14:21:16  no event      1         1
    2016-03-01 14:21:16  no event      9         1
    2016-03-01 15:09:42  no event      1         2
    2016-03-01 15:09:49     event      7         2
    2016-03-01 15:09:54  no event      3         2
    2016-03-01 15:09:55  no event      3         2
    2016-03-01 15:09:58  no event      5         2
    2016-03-01 15:09:58  no event      9         2
    2016-03-01 17:36:44  no event      8         3
    2016-03-01 17:36:44  no event      2         3
    2016-03-01 17:36:44  no event      9         3
    2016-03-01 17:36:45  no event      2         3
    2016-03-01 17:36:49     event      9         3
    2016-03-01 17:36:50  no event      6         3
    2016-03-01 17:36:54  no event      1         3
    2016-03-01 17:36:56  no event      1         3
    2016-03-01 18:51:37  no event      5         4
    2016-03-01 18:51:37  no event      3         4
    2016-03-01 18:51:42  no event      0         4
    2016-03-01 18:51:47     event      9         4
    2016-03-01 18:51:55  no event      4         4
    

    【讨论】:

    • 嗨@hyry,感谢您的代码。您还可以分享一个代码,显示如何按“汽车”列分组并独立处理每个组吗?
    • ps:我仍在尝试理解您提供的每个代码块.. :-)
    • 嘿@hyry,又是我……代码确实很聪明。做得好。唯一的缺点是下面的消息:试图在 DataFrame 中的切片副本上设置值。尝试使用 .loc[row_indexer,col_indexer] = value 查看文档中的警告:pandas.pydata.org/pandas-docs/stable/… df_event["event_id"] = evt_id[mask]
    • @guilhermecgs,试试这个df_event.loc[:, "event_id"] = evt_id[mask]
    【解决方案2】:

    考虑交叉连接合并,比较所有事件过滤的数据帧和完整的数据帧。然后子集记录落在同一辆车的 +/- 2 分钟内:

    数据框设置(发布数据示例)

    import pandas as pd
    import datetime
    
    df = pd.DataFrame({'Date': ['5/5/2011 12:43', '5/5/2011 12:45', '5/5/2011 12:49',
                                '5/5/2011 12:51', '5/6/2011 12:52', '5/6/2011 12:59', 
                                '5/6/2011 13:00', '5/5/2011 12:43', '5/5/2011 12:45', 
                                '5/5/2011 12:49', '5/5/2011 12:51', '5/6/2011 12:52',
                                '5/6/2011 12:52', '5/6/2011 13:00'],
                       'Event': ['noEvent', 'noEvent', 'EVENT', 'noEvent','EVENT',
                                 'noEvent', 'noEvent', 'noEvent', 'noEvent', 'noEvent',
                                 'noEvent', 'noEvent', 'EVENT', 'noEvent'],
                       'Car': ['CarA', 'CarA', 'CarA', 'CarA', 'CarA',
                               'CarA', 'CarA', 'CarB', 'CarB','CarB',
                               'CarB', 'CarB', 'CarB', 'CarB']})
    
        df['Date'] = pd.to_datetime(df['Date'])
    
    #      Car                Date    Event
    # 0   CarA 2011-05-05 12:43:00  noEvent
    # 1   CarA 2011-05-05 12:45:00  noEvent
    # 2   CarA 2011-05-05 12:49:00    EVENT
    # 3   CarA 2011-05-05 12:51:00  noEvent
    # 4   CarA 2011-05-06 12:52:00    EVENT
    # 5   CarA 2011-05-06 12:59:00  noEvent
    # 6   CarA 2011-05-06 13:00:00  noEvent
    # 7   CarB 2011-05-05 12:43:00  noEvent
    # 8   CarB 2011-05-05 12:45:00  noEvent
    # 9   CarB 2011-05-05 12:49:00  noEvent
    # 10  CarB 2011-05-05 12:51:00  noEvent
    # 11  CarB 2011-05-06 12:52:00  noEvent
    # 12  CarB 2011-05-06 12:52:00    EVENT
    # 13  CarB 2011-05-06 13:00:00  noEvent
    

    交叉连接(返回两个配对 M X N 之间的完整组合集)

    df['key'] = 1
    
    # EVENTS DF
    eventsdf = df[df['Event']=='EVENT']
    
    # CROSS JOIN DF
    crossdf = pd.merge(df, eventsdf, on='key')
    
    crossdf = crossdf[((crossdf['Date_x'] <= crossdf['Date_y'] 
                                   + datetime.timedelta(minutes=2)) &
                       (crossdf['Date_x'] >= crossdf['Date_y'] 
                                   - datetime.timedelta(minutes=2))) &
                       (crossdf['Car_x'] == crossdf['Car_y'])].sort_values('Date_x')
    
    finaldf = crossdf[['Car_x', 'Date_x', 'Event_x']].drop_duplicates().sort_values('Car_x')                      
    finaldf.columns = ['Car', 'Date', 'Event']
    
    #      Car                Date    Event
    # 6   CarA 2011-05-05 12:49:00    EVENT
    # 9   CarA 2011-05-05 12:51:00  noEvent
    # 13  CarA 2011-05-06 12:52:00    EVENT
    # 35  CarB 2011-05-06 12:52:00  noEvent
    # 38  CarB 2011-05-06 12:52:00    EVENT
    

    【讨论】:

    • 感谢您的代码。但是,我真的很害怕交叉连接,因为数据框很大
    猜你喜欢
    • 2013-06-09
    • 2020-08-08
    • 1970-01-01
    • 2023-02-13
    • 2021-12-01
    • 2020-08-16
    • 2016-08-31
    • 1970-01-01
    • 2021-12-25
    相关资源
    最近更新 更多