将大数据集拆分为较小的组答案

【问题标题】：Splitting a big dataset into smaller groups将大数据集拆分为较小的组
【发布时间】：2021-03-28 12:13:33
【问题描述】：

我正在尝试将包含 170 万个数据的大型数据集拆分为 3 列，并为每 2500 个数据分配一个组号，以便能够分别分析每个组并将其与其他组进行比较。例如，我想计算每个组的 RMS 并在最后绘制它们以查看它们的行为。我使用了从论坛中找到的以下代码，但它对我不起作用：

df1 = pd.concat(Combined) 

group_size = 2500  

numbers = list(range(len(df1.index) // group_size)) * group_size

numbers.sort()

numbers = pd.Series(numbers)

DF_1 = pd.concat([df1, numbers], axis=1)      ## I think this line is the problem of my code and gives me ValueError: Shape of passed values is (3686225, 4), indices imply (2367075, 4)##

DF_1.columns = ['X','Y','Z','group number']

groups = DF_1.groupby('group number').filter(lambda x: len(x) == group_size)

print(groups)

【问题讨论】：

标签： python pandas python-requests dataset

【解决方案1】：

import math
import numpy as np

def index_marks(nrows, chunk_size):
    return range(chunk_size, math.ceil(nrows / chunk_size) * chunk_size, chunk_size)

def split(dfm, chunk_size):
    indices = index_marks(dfm.shape[0], chunk_size)
    return np.split(dfm, indices)

# program loads the dataframe dfm.
df1 = pd.concat(Combined) 
group_size = 2500  

chunks = split(dfm, group_size)
for c in chunks:
    print("Shape: {}; {}".format(c.shape, c.index))

您可以对块 c 进行操作并继续工作。

【讨论】：