如何在熊猫数据框中制作相同数量的观察值？答案

【问题标题】：How do I make bins of equal number of observations in a pandas dataframe?如何在熊猫数据框中制作相同数量的观察值？
【发布时间】：2021-05-07 09:03:07
【问题描述】：

我正在尝试在数据框中创建一列来描述观察所属的组或bin。这个想法是根据某个列对数据框进行排序，然后开发另一列来表示该观察属于哪个 bin。如果我想要十分位数，那么我应该能够告诉一个函数我想要 10 个相等（或接近相等）的组。

我尝试了pandas qcut，但这只是给出了垃圾箱上限和下限的元组。我只想要 1,2,3,4....等。以以下为例

import numpy as np
import pandas as pd

x = [1,2,3,4,5,6,7,8,5,45,64545,65,6456,564]
y = np.random.rand(len(x))

df_dict = {'x': x, 'y': y}
df = pd.DataFrame(df_dict)

这给出了 14 个观察值的 df。我怎样才能得到 5 个相等的 bin 组？

期望的结果如下：

        x         y  group
0       1  0.926273      1
1       2  0.678101      1
2       3  0.636875      1
3       4  0.802590      2
4       5  0.494553      2
5       6  0.874876      2
6       7  0.607902      3
7       8  0.028737      3
8       5  0.493545      3
9      45  0.498140      4
10  64545  0.938377      4
11     65  0.613015      4
12   6456  0.288266      5
13    564  0.917817      5

【问题讨论】：

标签： python pandas

【解决方案1】：

按N 行分组，找到ngroup

df['group']=df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1



     x      y        group
0       1  0.548801      1
1       2  0.096620      1
2       3  0.713771      1
3       4  0.922987      2
4       5  0.283689      2
5       6  0.807755      2
6       7  0.592864      3
7       8  0.670315      3
8       5  0.034549      3
9      45  0.355274      4
10  64545  0.239373      4
11     65  0.156208      4
12   6456  0.419990      5
13    564  0.248278      5

【讨论】：

不错的+1，容易多了！
谢谢。是否有逻辑来选择垃圾箱的数量而不是告诉python要除以多少？因为这可能是动态的，我不知道在任何给定时间我得到了多少行
@Jordan，我不知道。如果它们是每个观察或信号的不同组会更容易

【解决方案2】：

另一个选项是从near_split 生成索引列表：

def near_split(base, num_bins):
    quotient, remainder = divmod(base, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)


bins = 5
df['group'] = [i + 1 for i, v in enumerate(near_split(len(df), bins)) for _ in range(v)]
print(df)

输出：

        x         y  group
0       1  0.313614      1
1       2  0.765079      1
2       3  0.153851      1
3       4  0.792098      2
4       5  0.123700      2
5       6  0.239107      2
6       7  0.133665      3
7       8  0.979318      3
8       5  0.781948      3
9      45  0.264344      4
10  64545  0.495561      4
11     65  0.504734      4
12   6456  0.766627      5
13    564  0.428423      5

【讨论】：

【解决方案3】：

你可以用np.array_split()平分，分配组，然后用pd.concat()重新组合：

bins = 5
splits = np.array_split(df, bins)

for i in range(len(splits)):
    splits[i]['group'] = i + 1

df = pd.concat(splits)

或者作为单行使用assign()：

df = pd.concat([d.assign(group=i+1) for i, d in enumerate(np.array_split(df, bins))])

        x         y  group
0       1  0.145781      1
1       2  0.262097      1
2       3  0.114799      1
3       4  0.275054      2
4       5  0.841606      2
5       6  0.187210      2
6       7  0.582487      3
7       8  0.019881      3
8       5  0.847115      3
9      45  0.755606      4
10  64545  0.196705      4
11     65  0.688639      4
12   6456  0.275884      5
13    564  0.579946      5

【讨论】：

【解决方案4】：

这是一种根据请求的数字bins“手动”计算垃圾箱范围的方法：

bins = 5

l = len(df)
minbinlen = l // bins
remainder = l % bins
repeats = np.repeat(minbinlen, bins)
repeats[:remainder] += 1
group = np.repeat(range(bins), repeats) + 1

df['group'] = group

结果：

        x         y  group
0       1  0.205168      1
1       2  0.105466      1
2       3  0.545794      1
3       4  0.639346      2
4       5  0.758056      2
5       6  0.982090      2
6       7  0.942849      3
7       8  0.284520      3
8       5  0.491151      3
9      45  0.731265      4
10  64545  0.072668      4
11     65  0.601416      4
12   6456  0.239454      5
13    564  0.345006      5

这似乎遵循np.array_split 的拆分逻辑（即尝试均匀拆分垃圾箱，但如果不可能，则添加到较早的垃圾箱）。

虽然代码不太简洁，但它不使用任何循环，因此理论上它应该更快地处理大量数据。

只是因为我很好奇，所以要把这个perfplot留在这里测试......

import numpy as np
import pandas as pd
import perfplot

def make_data(n):
    x = np.random.rand(n)
    y = np.random.rand(n)
    df_dict = {'x': x, 'y': y}
    df = pd.DataFrame(df_dict)

    return df

def repeat(df, bins=5):
    l = len(df)
    minbinlen = l // bins
    remainder = l % bins
    repeats = np.repeat(minbinlen, bins)
    repeats[:remainder] += 1
    group = np.repeat(range(bins), repeats) + 1

    return group

def near_split(base, num_bins):
    quotient, remainder = divmod(base, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)

def array_split(df, bins=5):
    splits = np.array_split(df, bins)

    for i in range(len(splits)):
        splits[i]['group'] = i + 1

    return pd.concat(splits)

perfplot.show(
    setup = lambda n : make_data(n),
    kernels = [
        lambda df: repeat(df),
        lambda df: [i + 1 for i, v in enumerate(near_split(len(df), 5)) for _ in range(v)],
        lambda df: df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1,
        lambda df: array_split(df)
        ],
    labels=['repeat', 'near_split', 'groupby', 'array_split'],
    n_range=[2 ** k for k in range(25)],
    equality_check=None)

【讨论】：

我也很好奇时间安排，所以感谢 perfplot！和不错的repeat 解决方案，+1