拆分数据框答案

【问题标题】：split a dataframe拆分数据框
【发布时间】：2020-02-23 16:51:07
【问题描述】：

打印（df）

【问题讨论】：

标签： pandas dataframe split threshold

【解决方案1】：

您可以在 B 列的累积总和上使用 pd.cut：

th = 50

# find the cumulative sum of B 
cumsum = df.B.cumsum()

# create the bins with spacing of th (threshold)
bins = list(range(0, cumsum.max() + 1, th))

# group by (split by) the bins
groups = pd.cut(cumsum, bins)

for key, group in df.groupby(groups):
    print(group)
    print()

输出

【讨论】：

嗨，Daniel，你知道我怎样才能给小组起不同的名字以便以后给他们打电话
你可以把它们放在字典里
@DanielMesejo 您可能对我回答中的时间安排感兴趣。没想到for loop 和numba 会这么快。
@Erfan 确实快很多，我猜 numba 是一个非常棒的工具

【解决方案2】：

这是一个使用numba加速我们的for loop的方法：

我们检查何时达到限制并重置total 计数并分配一个新的group：

from numba import njit

@njit
def cumsum_reset(array, limit):
    total = 0
    counter = 0 
    groups = np.empty(array.shape[0])
    for idx, i in enumerate(array):
        total += i
        if total >= limit or array[idx-1] == limit:
            counter += 1
            groups[idx] = counter
            total = 0
        else:
            groups[idx] = counter
    
    return groups

grps = cumsum_reset(df['B'].to_numpy(), 50)

for _, grp in df.groupby(grps):
    print(grp, '\n')

输出

时间安排：

# create dataframe of 600k rows
dfbig = pd.concat([df]*100000, ignore_index=True)
dfbig.shape

(600000, 2)

# Erfan
%%timeit
cumsum_reset(dfbig['B'].to_numpy(), 50)

4.25 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Daniel Mesejo
def daniel_mesejo(th, column):
    cumsum = column.cumsum()
    bins = list(range(0, cumsum.max() + 1, th))
    groups = pd.cut(cumsum, bins)
    
    return groups

%%timeit
daniel_mesejo(50, dfbig['B'])

10.3 s ± 2.17 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

结论，numba for 循环快了 24~ x 倍。

【讨论】：

如果您对groups 使用 numpy 数组而不是列表，您可以在 Numba 函数中获得很大的加速。
我试过了，但得到了一个错误type of variable cannot be determined。 @max9111
应该可以使用groups = np.empty(array.shape[0],dtype=np.uint64)而不是groups = []分配一个数组，并使用groups[idx]=counter而不是groups.append(counter)将结果写入数组。
我明白了，确实有效，将编辑答案。我尝试了groups=np.array([])，然后是groups = np.append(groups, counter)。这给了我一个错误。 @max9111
嗨 @Erfan 我感谢您的回答，但我需要根据阈值将数据始终拆分为 6 个 bin 可能我尝试编辑您的代码但它不起作用：如果总计 > = 50 或 array[idx-1] == 50 和 goups==3：