如何在熊猫中另一列的值之间聚合一列中的值答案

【问题标题】：How to aggregate values in a column between values of another column in pandas如何在熊猫中另一列的值之间聚合一列中的值
【发布时间】：2020-02-05 07:41:06
【问题描述】：

我有两个要合并的数据框。它们如下所示：

df_1
unit   start_time   stop_time
A        0.0          1.2
B        1.3          4.1
A        4.2          4.5
B        4.6          7.2
A        7.3          8.0

df_2
time    other_data
0.2       .0122
0.4       .0128
0.6       .0101
0.8       .0091
1.0       .2122
1.2       .1542
1.4       .1546
1.6       .1522
1.8       .2542
2.0       .1557
2.2       .2542
2.4       .1543
2.6       .0121
2.8       .0111
3.0       .0412
3.2       .0214
3.4       .0155
3.6       .0159
3.8       .0154
4.0       .0155
4.2       .0211
4.4       .0265
4.6       .0146
4.8       .0112
5.0       .0166
5.2       .0101
5.4       .0132
5.6       .0112
5.8       .0121
6.0       .0142
6.2       .0124
6.4       .0111
6.6       .0123
6.8       .0111
6.0       .0119
6.2       .0112
6.4       .0131
6.6       .0117
6.8       .0172
7.0       .0123
7.2       .0127
7.4       .0121
7.6       .0110
7.8       .0120
8.0       .0121

我想使用以下标准合并这些数据框：

步骤 1

我想对 df_2.other_data 中的所有值进行分组，其中 df_2.time 介于 df_1.start_time 和 df_1.stop_time 之间。例如，对于 df_1 的第一行，来自 df_2 的以下数据将被分组：

time    other_data
0.2       .0122
0.4       .0128
0.6       .0101
0.8       .0091
1.0       .2122
1.2       .1542

第二步

在该组中，我想计算 df_2.other_data 高于阈值的观察总数，在本例中，阈值将设置为 0.0120。该组中高于此阈值的观察总数为 4。这是我要合并到 df_1 上的值。结果应如下所示：

unit   start_time   stop_time   other_data_above_threshold
A        0.0          1.2             4

最终的数据框应如下所示：

unit   start_time   stop_time   other_data_above_threshold
A        0.0          1.2              4
B        1.3          4.1              13
A        4.2          4.5              3
B        4.6          7.2              11
A        7.3          8.0              4

【问题讨论】：

标签： python pandas

【解决方案1】：

IIUC，这正是你所需要的。

df['other_data_at'] = df.apply(lambda x: df2.loc[(df2['time']>= x['start_time']) & (df2['time']<= x['stop_time'])].loc[df2['other_data']>=0.012].count()[0], axis=1)

输出

   unit start_time  stop_time   other_data_at
0   A   0.0              1.2    4
1   B   1.3              4.1    13
2   A   4.2              4.5    2 #you expected output shows 3 but it should be 2
3   B   4.6              7.2    11
4   A   7.3              8.0    3

【讨论】：

【解决方案2】：

您好，我会尝试遍历您的 df1 并将其值用于 df2

看起来有点像这样：

def my_counting(df1, df2, threshold):
  count_list = ()
  for index,unit in enumerate(df['unit']):
    df = df2[(df2['time'] >= df1['start_time'][index]) & (df2['time'] < df1['stop_time'][index])]
    count_list.append(df[df['other_data'] <= threshold].shape[0])

  df1['other_data_above_threshold'] = count_list
  return df1

print(my_counting(df1, df2, 0.012)

【讨论】：

【解决方案3】：

你可以试试pd.cut

a = df_1.start_time.to_list() + [np.inf]
s = pd.cut(df_2.time, bins=a, labels=df_1.index, right=False)
df_1['other_data_above_threshold'] = df_2.other_data.gt(0.012).groupby(s).sum()

Out[213]:
  unit  start_time  stop_time  other_data_above_threshold
0    A         0.0        1.2                         4.0
1    B         1.3        4.1                        13.0
2    A         4.2        4.5                         2.0
3    B         4.6        7.2                        11.0
4    A         7.3        8.0                         2.0

【讨论】：