使用带有 groupby 和 group-specific bins 的 pandas cut 函数答案

【问题标题】：Using pandas cut function with groupby and group-specific bins使用带有 groupby 和 group-specific bins 的 pandas cut 函数
【发布时间】：2020-09-28 18:05:52
【问题描述】：

我有以下示例数据帧

import pandas as pd
import numpy as np

df = pd.DataFrame({'Tag': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C'],
                   'ID': [11, 12, 16, 19, 14, 9, 4, 13, 6, 18, 21, 1, 2], 
                   'Value': [1, 13, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

我使用

添加Value 的百分比

df['Percent_value'] = df['Value'].rank(method='dense', pct=True)

并使用 pd.cut() 和预定义的百分比箱添加 Order

percentage = np.array([10, 20, 50, 70, 100])/100

df['Order'] = pd.cut(df['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])

给了

    Tag  ID   Value Percent_value   Order
0    A   11     1   0.076923         1
1    A   12     13  1.000000         5
2    A   16     11  0.846154         5
3    B   19     12  0.923077         5
4    B   14     2   0.153846         2
5    B   9      3   0.230769         3
6    B   4      4   0.307692         3
7    C   13     5   0.384615         3
8    C   6      6   0.461538         3
9    C   18     7   0.538462         4
10   C   21     8   0.615385         4
11   C   1      9   0.692308         4
12   C   2      10  0.769231         5

我的问题

现在，我不再为所有标签（组）设置一个 percentage 数组（箱），而是为每个 Tag 组设置了一个单独的百分比数组。即A、B 和C。如何应用df.groupby('Tag')，然后应用pd.cut()，为以下字典中的每个组使用不同的百分比箱？像我在下面做的那样，是否有一些直接的方式避免 for 循环？

percentages = {'A': np.array([10, 20, 50, 70, 100])/100,
               'B': np.array([20, 40, 60, 90, 100])/100,
               'C': np.array([30, 50, 60, 80, 100])/100}

期望的结果（注意：Order 现在使用不同的 bin 为每个 Tag 计算）：

    Tag  ID   Value Percent_value   Order
0    A   11     1    0.076923        1          
1    A   12     13   1.000000        5
2    A   16     11   0.846154        5
3    B   19     12   0.923077        5
4    B   14     2    0.153846        1
5    B   9      3    0.230769        2
6    B   4      4    0.307692        2
7    C   13     5    0.384615        2
8    C   6      6    0.461538        2
9    C   18     7    0.538462        3
10   C   21     8    0.615385        4
11   C   1      9    0.692308        4
12   C   2      10   0.769231        4

我的尝试

orders = []

for k, g in df.groupby(['Tag']):
    percentage = percentages[k]
    g['Order'] = pd.cut(g['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])
    orders.append(g)

df_final = pd.concat(orders, axis=0, join='outer')

【问题讨论】：

标签： python pandas dataframe group-by

【解决方案1】：

您可以在 groupby 中应用 pd.cut，

df['Order'] = df.groupby('Tag').apply(lambda x: pd.cut(x['Percent_value'], bins=np.insert(percentages[x.name],0,0), labels=[1,2,3,4,5])).reset_index(drop = True)


    Tag ID  Value   Percent_value   Order
0   A   11  1         0.076923        1
1   A   12  13        1.000000        5
2   A   16  11        0.846154        5
3   B   19  12        0.923077        5
4   B   14  2         0.153846        1
5   B   9   3         0.230769        2
6   B   4   4         0.307692        2
7   C   13  5         0.384615        2
8   C   6   6         0.461538        2
9   C   18  7         0.538462        3
10  C   21  8         0.615385        4
11  C   1   9         0.692308        4
12  C   2   10        0.769231        4

【讨论】：