根据 groupby 和条件对列求和答案

【问题标题】：Sum a column based on groupby and condition根据 groupby 和条件对列求和
【发布时间】：2020-06-17 19:51:22
【问题描述】：

我有一个数据框和一些列。我想对“间隙”列求和，其中时间在某些时隙中。

   region.    date.   time.     gap
0   1   2016-01-01  00:00:08    1
1   1   2016-01-01  00:00:48    0
2   1   2016-01-01  00:02:50    1
3   1   2016-01-01  00:00:52    0
4   1   2016-01-01  00:10:01    0
5   1   2016-01-01  00:10:03    1
6   1   2016-01-01  00:10:05    0
7   1   2016-01-01  00:10:08    0

我想总结差距列。我有这样的 dict 时隙。

'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'

现在求和后，上面的数据框应该是这样的。

 region.    date.       time.      gap
0   1   2016-01-01  00:10:00/slot1  2
1   1   2016-01-01  00:20:00/slot2  1

我有很多区域和 144 个时间段，从 00:00:00 到 23:59:49。我试过这个。

regres=reg.groupby(['start_region_hash','Date','Time'])['Time'].apply(lambda x: (x >= hoursdict['slot1']) & (x <= hoursdict['slot2'])).sum()

但它不起作用。

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

想法是将列time 转换为datetimes 和floor by 10Min，然后转换为字符串HH:MM:SS：

d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}

df['time'] = pd.to_datetime(df['time']).dt.floor('10Min').dt.strftime('%H:%M:%S')
print (df)
   region        date      time  gap
0       1  2016-01-01  00:00:00    1
1       1  2016-01-01  00:00:00    0
2       1  2016-01-01  00:00:00    1
3       1  2016-01-01  00:00:00    0
4       1  2016-01-01  00:10:00    0
5       1  2016-01-01  00:10:00    1
6       1  2016-01-01  00:10:00    0
7       1  2016-01-01  00:10:00    0

通过字典聚合 sum 和最后一个 map 值，并交换键和值：

regres = df.groupby(['region','date','time'], as_index=False)['gap'].sum()
regres['time'] = regres['time'] + '/' + regres['time'].map(d1)
print (regres)
   region        date            time  gap
0       1  2016-01-01  00:00:00/slot1    2
1       1  2016-01-01  00:10:00/slot2    1

如果要显示下一个10Min 插槽：

d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}

times = pd.to_datetime(df['time']).dt.floor('10Min')
df['time'] = times.dt.strftime('%H:%M:%S')
df['time1'] = times.add(pd.Timedelta('10Min')).dt.strftime('%H:%M:%S')
print (df)
   region        date      time  gap     time1
0       1  2016-01-01  00:00:00    1  00:10:00
1       1  2016-01-01  00:00:00    0  00:10:00
2       1  2016-01-01  00:00:00    1  00:10:00
3       1  2016-01-01  00:00:00    0  00:10:00
4       1  2016-01-01  00:10:00    0  00:20:00
5       1  2016-01-01  00:10:00    1  00:20:00
6       1  2016-01-01  00:10:00    0  00:20:00
7       1  2016-01-01  00:10:00    0  00:20:00

regres = df.groupby(['region','date','time','time1'], as_index=False)['gap'].sum()
regres['time'] = regres.pop('time1') + '/' + regres['time'].map(d1)
print (regres)
   region        date            time  gap
0       1  2016-01-01  00:10:00/slot1    2
1       1  2016-01-01  00:20:00/slot2    1

编辑：

对地板和转换为字符串的改进是通过cut 或searchsorted 使用bining：

df['time'] = pd.to_timedelta(df['time'])

bins = pd.timedelta_range('00:00:00', '24:00:00', freq='10Min')
labels = np.array(['{}'.format(str(x)[-8:]) for x in bins])
labels = labels[:-1]

df['time1'] = pd.cut(df['time'], bins=bins, labels=labels)
df['time11'] = labels[np.searchsorted(bins, df['time'].values) - 1]

【讨论】：

进程慢吗？在超过 60 万条记录的数据中花费了太多时间。
@shahidhamdam - 需要更多时间，但可能会更快
@shahidhamdam - 一件事 - 你需要第二个解决方案还是第一个？
我需要第一个解决方案，但还有一件事。我也想计算一个时隙中的行数。例如在 slot1 中，有 4 行。你能帮我吗？
@shahidhamdam - 使用regres = df.groupby(['region','date','time','time1'], as_index=False).size().reset_index(name='count')

【解决方案2】：

只是为了避免日期时间比较的复杂性（除非这是你的全部观点，在这种情况下忽略我的回答），并展示这个按时隙窗口问题分组的本质，我在这里假设时间是整数。

df = pd.DataFrame({'time':[8, 48, 250, 52, 1001, 1003, 1005, 1008, 2001, 2003, 2056], 
                   'gap': [1, 0,  1,   0,  0,    1,    0,    0,    1,    1,    1]})
slots = np.array([0, 1000, 1500])
df['slot'] = df.apply(func = lambda x: slots[np.argmax(slots[x['time']>slots])], axis=1)
df.groupby('slot')[['gap']].sum()

输出

       gap
slot    
-----------
0       2
1000    1
1500    3

【讨论】：

【解决方案3】：

考虑解决此问题的方法是首先将您的time 列转换为您想要的值，然后对time 列执行groupby sum。

下面的代码显示了我使用的方法。我使用np.select 来包含尽可能多的条件和条件选项。在我将time 转换为我想要的值后，我做了一个简单的groupby sum 真的不需要格式化时间或转换字符串等大惊小怪。只需让 pandas 数据框直观地处理即可。

#Just creating the DataFrame using a dictionary here
regdict = {
        'time': ['00:00:08','00:00:48','00:02:50','00:00:52','00:10:01','00:10:03','00:10:05','00:10:08'],
        'gap': [1,0,1,0,0,1,0,0],}

df = pd.DataFrame(regdict)


import pandas as pd
import numpy as np #This is the library you require for np.select function    

#Add in all your conditions and options here
condlist = [df['time']<'00:10:00',df['time']<'00:20:00'] 
choicelist = ['00:10:00/slot1','00:20:00/slot2'] 

#Use np.select after you have defined all your conditions and options
answerlist = np.select(condlist, choicelist)
print (answerlist)
['00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1'
'00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2']

#Assign answerlist to df['time']
df['time'] = answerlist
print (df)
       time  gap
0  00:10:00    1
1  00:10:00    0
2  00:10:00    1
3  00:10:00    0
4  00:20:00    0
5  00:20:00    1
6  00:20:00    0
7  00:20:00    0

df = df.groupby('time', as_index=False)['gap'].sum()
print (df) 
       time  gap
0  00:10:00    2
1  00:20:00    1

如果你想保留原来的时间，你可以改为df['timeNew'] = answerlist，然后从那里过滤。

df['timeNew'] = answerlist
print (df)
       time  gap         timeNew
0  00:00:08    1  00:10:00/slot1
1  00:00:48    0  00:10:00/slot1
2  00:02:50    1  00:10:00/slot1
3  00:00:52    0  00:10:00/slot1
4  00:10:01    0  00:20:00/slot2
5  00:10:03    1  00:20:00/slot2
6  00:10:05    0  00:20:00/slot2
7  00:10:08    0  00:20:00/slot2

#Use transform function here to retain all prior values
df['aggregate sum of gap'] = df.groupby('timeNew')['gap'].transform(sum)
print (df) 
       time  gap         timeNew  aggregate sum of gap
0  00:00:08    1  00:10:00/slot1                     2
1  00:00:48    0  00:10:00/slot1                     2
2  00:02:50    1  00:10:00/slot1                     2
3  00:00:52    0  00:10:00/slot1                     2
4  00:10:01    0  00:20:00/slot2                     1
5  00:10:03    1  00:20:00/slot2                     1
6  00:10:05    0  00:20:00/slot2                     1
7  00:10:08    0  00:20:00/slot2                     1

【讨论】：