【问题标题】:Sum a column based on groupby and condition根据 groupby 和条件对列求和
【发布时间】:2020-06-17 19:51:22
【问题描述】:

我有一个数据框和一些列。我想对“间隙”列求和,其中时间在某些时隙中。

   region.    date.   time.     gap
0   1   2016-01-01  00:00:08    1
1   1   2016-01-01  00:00:48    0
2   1   2016-01-01  00:02:50    1
3   1   2016-01-01  00:00:52    0
4   1   2016-01-01  00:10:01    0
5   1   2016-01-01  00:10:03    1
6   1   2016-01-01  00:10:05    0
7   1   2016-01-01  00:10:08    0

我想总结差距列。我有这样的 dict 时隙。

'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'

现在求和后,上面的数据框应该是这样的。

 region.    date.       time.      gap
0   1   2016-01-01  00:10:00/slot1  2
1   1   2016-01-01  00:20:00/slot2  1

我有很多区域和 144 个时间段,从 00:00:00 到 23:59:49。我试过这个。

regres=reg.groupby(['start_region_hash','Date','Time'])['Time'].apply(lambda x: (x >= hoursdict['slot1']) & (x <= hoursdict['slot2'])).sum()

但它不起作用。

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    想法是将列time 转换为datetimesfloor by 10Min,然后转换为字符串HH:MM:SS

    d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
    d1 = {v:k for k, v in d.items()}
    
    df['time'] = pd.to_datetime(df['time']).dt.floor('10Min').dt.strftime('%H:%M:%S')
    print (df)
       region        date      time  gap
    0       1  2016-01-01  00:00:00    1
    1       1  2016-01-01  00:00:00    0
    2       1  2016-01-01  00:00:00    1
    3       1  2016-01-01  00:00:00    0
    4       1  2016-01-01  00:10:00    0
    5       1  2016-01-01  00:10:00    1
    6       1  2016-01-01  00:10:00    0
    7       1  2016-01-01  00:10:00    0
    

    通过字典聚合 sum 和最后一个 map 值,并交换键和值:

    regres = df.groupby(['region','date','time'], as_index=False)['gap'].sum()
    regres['time'] = regres['time'] + '/' + regres['time'].map(d1)
    print (regres)
       region        date            time  gap
    0       1  2016-01-01  00:00:00/slot1    2
    1       1  2016-01-01  00:10:00/slot2    1
    

    如果要显示下一个10Min 插槽:

    d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
    d1 = {v:k for k, v in d.items()}
    
    times = pd.to_datetime(df['time']).dt.floor('10Min')
    df['time'] = times.dt.strftime('%H:%M:%S')
    df['time1'] = times.add(pd.Timedelta('10Min')).dt.strftime('%H:%M:%S')
    print (df)
       region        date      time  gap     time1
    0       1  2016-01-01  00:00:00    1  00:10:00
    1       1  2016-01-01  00:00:00    0  00:10:00
    2       1  2016-01-01  00:00:00    1  00:10:00
    3       1  2016-01-01  00:00:00    0  00:10:00
    4       1  2016-01-01  00:10:00    0  00:20:00
    5       1  2016-01-01  00:10:00    1  00:20:00
    6       1  2016-01-01  00:10:00    0  00:20:00
    7       1  2016-01-01  00:10:00    0  00:20:00
    
    regres = df.groupby(['region','date','time','time1'], as_index=False)['gap'].sum()
    regres['time'] = regres.pop('time1') + '/' + regres['time'].map(d1)
    print (regres)
       region        date            time  gap
    0       1  2016-01-01  00:10:00/slot1    2
    1       1  2016-01-01  00:20:00/slot2    1
    

    编辑:

    对地板和转换为字符串的改进是通过cutsearchsorted 使用bining:

    df['time'] = pd.to_timedelta(df['time'])
    
    bins = pd.timedelta_range('00:00:00', '24:00:00', freq='10Min')
    labels = np.array(['{}'.format(str(x)[-8:]) for x in bins])
    labels = labels[:-1]
    
    df['time1'] = pd.cut(df['time'], bins=bins, labels=labels)
    df['time11'] = labels[np.searchsorted(bins, df['time'].values) - 1]
    

    【讨论】:

    • 进程慢吗?在超过 60 万条记录的数据中花费了太多时间。
    • @shahidhamdam - 需要更多时间,但可能会更快
    • @shahidhamdam - 一件事 - 你需要第二个解决方案还是第一个?
    • 我需要第一个解决方案,但还有一件事。我也想计算一个时隙中的行数。例如在 slot1 中,有 4 行。你能帮我吗?
    • @shahidhamdam - 使用regres = df.groupby(['region','date','time','time1'], as_index=False).size().reset_index(name='count')
    【解决方案2】:

    只是为了避免日期时间比较的复杂性(除非这是你的全部观点,在这种情况下忽略我的回答),并展示这个按时隙窗口问题分组的本质,我在这里假设时间是整数。

    df = pd.DataFrame({'time':[8, 48, 250, 52, 1001, 1003, 1005, 1008, 2001, 2003, 2056], 
                       'gap': [1, 0,  1,   0,  0,    1,    0,    0,    1,    1,    1]})
    slots = np.array([0, 1000, 1500])
    df['slot'] = df.apply(func = lambda x: slots[np.argmax(slots[x['time']>slots])], axis=1)
    df.groupby('slot')[['gap']].sum()
    

    输出

           gap
    slot    
    -----------
    0       2
    1000    1
    1500    3
    

    【讨论】:

      【解决方案3】:

      考虑解决此问题的方法是首先将您的time 列转换为您想要的值,然后对time 列执行groupby sum

      下面的代码显示了我使用的方法。我使用np.select 来包含尽可能多的条件和条件选项。在我将time 转换为我想要的值后,我做了一个简单的groupby sum 真的不需要格式化时间或转换字符串等大惊小怪。只需让 pandas 数据框直观地处理即可。

      #Just creating the DataFrame using a dictionary here
      regdict = {
              'time': ['00:00:08','00:00:48','00:02:50','00:00:52','00:10:01','00:10:03','00:10:05','00:10:08'],
              'gap': [1,0,1,0,0,1,0,0],}
      
      df = pd.DataFrame(regdict)
      
      
      import pandas as pd
      import numpy as np #This is the library you require for np.select function    
      
      #Add in all your conditions and options here
      condlist = [df['time']<'00:10:00',df['time']<'00:20:00'] 
      choicelist = ['00:10:00/slot1','00:20:00/slot2'] 
      
      #Use np.select after you have defined all your conditions and options
      answerlist = np.select(condlist, choicelist)
      print (answerlist)
      ['00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1'
      '00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2']
      
      #Assign answerlist to df['time']
      df['time'] = answerlist
      print (df)
             time  gap
      0  00:10:00    1
      1  00:10:00    0
      2  00:10:00    1
      3  00:10:00    0
      4  00:20:00    0
      5  00:20:00    1
      6  00:20:00    0
      7  00:20:00    0
      
      df = df.groupby('time', as_index=False)['gap'].sum()
      print (df) 
             time  gap
      0  00:10:00    2
      1  00:20:00    1
      

      如果你想保留原来的时间,你可以改为df['timeNew'] = answerlist,然后从那里过滤。

      df['timeNew'] = answerlist
      print (df)
             time  gap         timeNew
      0  00:00:08    1  00:10:00/slot1
      1  00:00:48    0  00:10:00/slot1
      2  00:02:50    1  00:10:00/slot1
      3  00:00:52    0  00:10:00/slot1
      4  00:10:01    0  00:20:00/slot2
      5  00:10:03    1  00:20:00/slot2
      6  00:10:05    0  00:20:00/slot2
      7  00:10:08    0  00:20:00/slot2
      
      #Use transform function here to retain all prior values
      df['aggregate sum of gap'] = df.groupby('timeNew')['gap'].transform(sum)
      print (df) 
             time  gap         timeNew  aggregate sum of gap
      0  00:00:08    1  00:10:00/slot1                     2
      1  00:00:48    0  00:10:00/slot1                     2
      2  00:02:50    1  00:10:00/slot1                     2
      3  00:00:52    0  00:10:00/slot1                     2
      4  00:10:01    0  00:20:00/slot2                     1
      5  00:10:03    1  00:20:00/slot2                     1
      6  00:10:05    0  00:20:00/slot2                     1
      7  00:10:08    0  00:20:00/slot2                     1
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-01-25
        • 2021-10-12
        • 2018-04-14
        • 1970-01-01
        • 1970-01-01
        • 2021-05-12
        • 2021-05-05
        相关资源
        最近更新 更多