【问题标题】:How to expand efficiently daterange pandas dataframe with groups如何使用组有效地扩展 daterange pandas 数据框
【发布时间】:2019-09-30 03:00:23
【问题描述】:

我有一个包含多个组的大型数据集,其中包含两列开始和结束日期以及一个值列(每个组可以有多个值) 我想有效地扩展它并获得一个新的数据框,它有时间(以秒为单位)作为每个组的索引和列,每个组将在其中存储值

数据如下:

import pandas as pd
import datetime as dt
import numpy as np

df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3,5,22,21), dt.datetime(2017, 4, 5,3,51,22),\
               dt.datetime(2017, 4, 4,4,23,33),dt.datetime(2017, 4, 3,7,28,45),\
               dt.datetime(2017, 4, 6,5,22,24),dt.datetime(2017, 4, 6,5,22,56)]

df['end'] = [dt.datetime(2017, 4, 3,6,33,23), dt.datetime(2017, 4,5,3,52,46),\
             dt.datetime(2017, 4,4,4,58,12),dt.datetime(2017, 4, 4,1,23,34),\
            dt.datetime(2017, 4, 7,5,22,24),dt.datetime(2017, 4, 7,5,22,47)]
df['group'] = ['1', '2', '3','1','2','3']
df['value'] = ['a', 'b', 'c','b','c','a']

start   end group   value
0   2017-04-03 05:22:21 2017-04-03 06:33:23 1   a
1   2017-04-05 03:51:22 2017-04-05 03:52:46 2   b
2   2017-04-04 04:23:33 2017-04-04 04:58:12 3   c
3   2017-04-03 07:28:45 2017-04-04 01:23:34 1   b
4   2017-04-06 05:22:24 2017-04-03 05:22:24 2   c
5   2017-04-03 05:22:56 2017-04-03 05:22:47 3   a

我尝试了以下方法:

  1. 构造一个新的数据帧,索引在最早开始和最晚结束的范围内。

  2. 按 group_ID 分组

  3. 遍历组行,从每一行创建一个小的数据框,索引在行的开始日期和行的结束日期存储行的值

4.将同一组中的小数据帧连接成一个数据帧

  1. 将组数据框(实际上是日期索引上的一列值)加入(左连接)到大数据框(将其添加为列)

这里是sn-p:


def turn_deltas(row,col):
    key = str(row['group'])
    df = pd.DataFrame(index=pd.date_range(row['start'], row['end'], freq="1S"))
    df[key] = row[col]
    return df

grouped = df.groupby("group")
data = pd.DataFrame(index=pd.date_range(df['start'].min(), df['end'].max(), freq="1s")) 
for name, group in (grouped):
    for i, row in enumerate(group.iterrows()):
        if i == 0:
            df_2 = turn_deltas(row[1],"value")
        else:
            df_2 = pd.concat([df_2, turn_deltas(row[1],"value")], axis=0)
    data = data.merge(df_2, how="left", left_index=True, right_index=True)

print (data)

我的代码正在运行,但执行任务非常(非常)慢

最后,我得到了这个更新的数据框:

2017-04-03 05:22:21    a  NaN  NaN
2017-04-03 05:22:22    a  NaN  NaN
2017-04-03 05:22:23    a  NaN  NaN
2017-04-03 05:22:24    a  NaN  NaN
2017-04-03 05:22:25    a  NaN  NaN
2017-04-03 05:22:26    a  NaN  NaN
2017-04-03 05:22:27    a  NaN  NaN
2017-04-03 05:22:28    a  NaN  NaN
2017-04-03 05:22:29    a  NaN  NaN
2017-04-03 05:22:30    a  NaN  NaN
2017-04-03 05:22:31    a  NaN  NaN
2017-04-03 05:22:32    a  NaN  NaN
2017-04-03 05:22:33    a  NaN  NaN
2017-04-03 05:22:34    a  NaN  NaN
2017-04-03 05:22:35    a  NaN  NaN
2017-04-03 05:22:36    a  NaN  NaN
2017-04-03 05:22:37    a  NaN  NaN
2017-04-03 05:22:38    a  NaN  NaN
2017-04-03 05:22:39    a  NaN  NaN
2017-04-03 05:22:40    a  NaN  NaN
2017-04-03 05:22:41    a  NaN  NaN
2017-04-03 05:22:42    a  NaN  NaN
2017-04-03 05:22:43    a  NaN  NaN
2017-04-03 05:22:44    a  NaN  NaN
2017-04-03 05:22:45    a  NaN  NaN
2017-04-03 05:22:46    a  NaN  NaN
2017-04-03 05:22:47    a  NaN  NaN
2017-04-03 05:22:48    a  NaN  NaN
2017-04-03 05:22:49    a  NaN  NaN
2017-04-03 05:22:50    a  NaN  NaN
...                  ...  ...  ...
2017-04-07 05:22:18  NaN    c    a
2017-04-07 05:22:19  NaN    c    a
2017-04-07 05:22:20  NaN    c    a
2017-04-07 05:22:21  NaN    c    a
2017-04-07 05:22:22  NaN    c    a
2017-04-07 05:22:23  NaN    c    a
2017-04-07 05:22:24  NaN    c    a
2017-04-07 05:22:25  NaN  NaN    a
2017-04-07 05:22:26  NaN  NaN    a
2017-04-07 05:22:27  NaN  NaN    a
2017-04-07 05:22:28  NaN  NaN    a
2017-04-07 05:22:29  NaN  NaN    a
2017-04-07 05:22:30  NaN  NaN    a
2017-04-07 05:22:31  NaN  NaN    a
2017-04-07 05:22:32  NaN  NaN    a
2017-04-07 05:22:33  NaN  NaN    a
2017-04-07 05:22:34  NaN  NaN    a
2017-04-07 05:22:35  NaN  NaN    a
2017-04-07 05:22:36  NaN  NaN    a
2017-04-07 05:22:37  NaN  NaN    a
2017-04-07 05:22:38  NaN  NaN    a
2017-04-07 05:22:39  NaN  NaN    a
2017-04-07 05:22:40  NaN  NaN    a
2017-04-07 05:22:41  NaN  NaN    a
2017-04-07 05:22:42  NaN  NaN    a
2017-04-07 05:22:43  NaN  NaN    a
2017-04-07 05:22:44  NaN  NaN    a
2017-04-07 05:22:45  NaN  NaN    a
2017-04-07 05:22:46  NaN  NaN    a
2017-04-07 05:22:47  NaN  NaN    a

注意: 这段代码只是整个项目的一部分。 执行此转换后,我还使用 get_dummies() 为每列的每个值获取一个单独的列,以便您也可以将其纳入您的实施策略

谢谢!

【问题讨论】:

    标签: python pandas date pandas-groupby processing-efficiency


    【解决方案1】:

    我将使用merge_ordered 为每个组构建一​​个数据框,该数据框由您的data 数据框的索引索引。它将具有不需要的值,因此应该对其进行清理。但从那时起,构建您的最终数据框就很容易了:

    for g, dg in df.groupby('group'):
        # build a dataframe per group with the final index
        dy = pd.merge_ordered(data.rename_axis('dat').reset_index(), dg,
             left_on='dat', right_on='start', fill_method='ffill')
        # clean values outside of [start:end] range
        dy.loc[(dy.start>dy.dat)|(dy.dat>dy.end), 'group'] = np.nan
        dy.loc[(dy.start>dy.dat)|(dy.dat>dy.end), 'value'] = np.nan
        # and use that to set the column in the final dataframe
        data[g] = dy.set_index('dat').value
    

    如果性能真的很重要,那么正确使用索引会产生影响。这个版本应该快 3 倍左右:

    for g, dg in df.groupby('group'):
        # build a dataframe per group with the final index
        dy = pd.merge_asof(data, dg.set_index('start'),
                     left_index=True, right_index=True)
        # clean values outside of [start:end] range
        dy.loc[dy.index>dy.end,'value'] = np.nan
        # and use that to set the column in the final dataframe
        data[g] = dy.value
    

    【讨论】:

    • 您好,感谢您的回答。我注意到了两件事: 1. 我运行了您的代码,但结果与期望的完全不同 - 最后您会看到在您的结果中每列都有一些值,但它不会发生(例如第 1 列有 ' b' 值在 07-04-2017 但最后一个事件在 04-04-17 结束
    • @AviEini:我在dy.loc[..., 'value'] = np.nan 行中有错字:我写了&,而它应该是|。我已经编辑了我的帖子,希望现在结果会更好......
    • 感谢您的指正。效果很好! 2.关于效率。我用我的 Jupyter 笔记本试了一下,并使用 %%timeit 来检查差异——我的代码时间需要 203 毫秒,而你的需要 880 毫秒。如果你能再看看,我会很高兴的。谢谢!
    • @AviEini:我可以稍微改进一下我的版本,它应该在一个小数据帧上提供与您相同的性能。对于较大的版本,我的版本应该更快,因为它们不需要任何 iterrows 并且是完全矢量化的。
    【解决方案2】:

    首先,您应该真正将值转换为对象以外的某个 dtype,即使用 0,1,2 而不是 'a','b','c'。

    至于转换代码,这似乎非常快,至少在您的示例 df 上。并且非常简短易读。

    data = pd.DataFrame(index=pd.date_range(df['start'].min(), df['end'].max(), freq="1S"))
    
    for i,row in df.iterrows():
        data.loc[(data.index >= row['start'])&(data.index<=row['end']),
                 row['group']] = row['value']    
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2012-10-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-03-05
      • 1970-01-01
      • 2020-09-27
      • 2022-01-12
      相关资源
      最近更新 更多