【问题标题】：Split numpy array into chunks by maxmimum size按最大大小将 numpy 数组拆分为块
【发布时间】：2018-10-22 16:10:15
【问题描述】：

我有一些非常大的二维 numpy 数组。一个数据集是 55732 x 257659，包含超过 140 亿个元素。因为我需要执行一些操作 throw MemoryErrors，所以我想尝试将数组拆分为一定大小的块并针对这些块运行它们。（我可以在每个片段上运行操作后汇总结果。）我的问题是MemoryErrors 这意味着我可以以某种方式限制数组的大小，而不是将它们分成恒定数量的片段，这一点很重要。

例如，让我们生成一个 1009 x 1009 的随机数组：

a = numpy.random.choice([1,2,3,4], (1009,1009))

我的数据不一定可以平均分割，也绝对不能保证可以按我想要的大小进行分割。所以我选择了 1009，因为它是素数。

假设我希望它们分成不超过 50 x 50 的块。因为这只是为了避免非常大的数组出现错误，所以如果结果不准确也没关系。

如何将其拆分为所需的块？

我正在使用 Python 3.6 64 位和 numpy 1.14.3（最新）。

略小于最大值

逻辑

首先将最终块大小的形状沿要拆分的每个维度存储在一个元组中：

chunk_shape = (50, 50)

array_split 一次只沿一个轴（或维度）或一个数组拆分。所以让我们从第一个轴开始。

计算我们需要将数组拆分成的部分数：
```
num_sections = math.ceil(a.shape[0] / chunk_shape[0])
```
在我们的示例中，这是 21 (1009 / 50 = 20.18)。

现在拆分它：

first_split = numpy.array_split(a, num_sections, axis=0)

这为我们提供了 21 个（请求部分的数量）numpy 数组的列表，这些数组被拆分，因此它们在第一维中不大于 50：

print(len(first_split))
# 21
print({i.shape for i in first_split})
# {(48, 1009), (49, 1009)}
# These are the distinct shapes, so we don't see all 21 separately

在这种情况下，它们是沿该轴的 48 和 49。

我们可以对第二维的每个新数组做同样的事情：

num_sections = math.ceil(a.shape[1] / chunk_shape[1])
second_split = [numpy.array_split(a2, num_sections, axis=1) for a2 in first_split]

这给了我们一个列表列表。每个子列表都包含我们想要的大小的 numpy 数组：

print(len(second_split))
# 21
print({len(i) for i in second_split})
# {21}
# All sublists are 21 long
print({i2.shape for i in second_split for i2 in i})
# {(48, 49), (49, 48), (48, 48), (49, 49)}
# Distinct shapes

完整功能

我们可以使用递归函数实现任意维度：

def split_to_approx_shape(a, chunk_shape, start_axis=0):
    if len(chunk_shape) != len(a.shape):
        raise ValueError('chunk length does not match array number of axes')

    if start_axis == len(a.shape):
        return a

    num_sections = math.ceil(a.shape[start_axis] / chunk_shape[start_axis])
    split = numpy.array_split(a, num_sections, axis=start_axis)
    return [split_to_approx_shape(split_a, chunk_shape, start_axis + 1) for split_a in split]

我们这样称呼它：

full_split = split_to_approx_shape(a, (50,50))
print({i2.shape for i in full_split for i2 in i})
# {(48, 49), (49, 48), (48, 48), (49, 49)}
# Distinct shapes

精确的形状加上余数

逻辑

如果我们想要更漂亮一点并且让所有新数组完全除了尾随剩余数组之外的指定大小，我们可以通过传递要拆分的索引列表来做到这一点array_split.

首先建立索引数组：

axis = 0
split_indices = [chunk_shape[axis]*(i+1) for i  in range(math.floor(a.shape[axis] / chunk_shape[axis]))]

这给出了一个索引列表，从最后一个开始每个 50 个：

print(split_indices)
# [50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000]

然后拆分：

first_split = numpy.array_split(a, split_indices, axis=0)
print(len(first_split))
# 21
print({i.shape for i in first_split})
# {(9, 1009), (50, 1009)}
# Distinct shapes, so we don't see all 21 separately
print((first_split[0].shape, first_split[1].shape, '...', first_split[-2].shape, first_split[-1].shape))
# ((50, 1009), (50, 1009), '...', (50, 1009), (9, 1009))

然后是第二个轴：

axis = 1
split_indices = [chunk_shape[axis]*(i+1) for i  in range(math.floor(a.shape[axis] / chunk_shape[axis]))]
second_split = [numpy.array_split(a2, split_indices, axis=1) for a2 in first_split]
print({i2.shape for i in second_split for i2 in i})
# {(9, 50), (9, 9), (50, 9), (50, 50)}

完整功能

适配递归函数：

def split_to_shape(a, chunk_shape, start_axis=0):
    if len(chunk_shape) != len(a.shape):
        raise ValueError('chunk length does not match array number of axes')

    if start_axis == len(a.shape):
        return a

    split_indices = [
        chunk_shape[start_axis]*(i+1)
        for i in range(math.floor(a.shape[start_axis] / chunk_shape[start_axis]))
    ]
    split = numpy.array_split(a, split_indices, axis=start_axis)
    return [split_to_shape(split_a, chunk_shape, start_axis + 1) for split_a in split]

而且我们的称呼完全一样：

full_split = split_to_shape(a, (50,50))
print({i2.shape for i in full_split for i2 in i})
# {(9, 50), (9, 9), (50, 9), (50, 50)}
# Distinct shapes

补充说明

性能

这些功能似乎相当快。我能够在 0.05 秒内使用任一函数将我的示例数组（包含超过 140 亿个元素）拆分为 1000 x 1000 个形状的块（产生超过 14000 个新数组）：

print('Building test array')
a = numpy.random.randint(4, size=(55000, 250000), dtype='uint8')
chunks = (1000, 1000)
numtests = 1000
print('Running {} tests'.format(numtests))
print('split_to_approx_shape: {} seconds'.format(timeit.timeit(lambda: split_to_approx_shape(a, chunks), number=numtests) / numtests))
print('split_to_shape: {} seconds'.format(timeit.timeit(lambda: split_to_shape(a, chunks), number=numtests) / numtests))

输出：

Building test array
Running 1000 tests
split_to_approx_shape: 0.035109398348040485 seconds
split_to_shape: 0.03113800323300747 seconds

我没有测试高维数组的速度。

小于最大值的形状

如果任何维度的大小小于指定的最大值，这些函数都可以正常工作。这不需要特殊的逻辑。

【讨论】：

精彩的答案！我只想指出np.block(full_split) 重新组装了原始数组。

【解决方案2】：

由于我不知道您的数据是如何生成或将如何处理的，我可以建议两种方法：

启用numpy的reshape

填充数组以允许将其重塑为您的块尺寸。只需用零填充，这样每个(axis_size % chunk_size) == 0。每个轴的chunk_size 可能不同。

像这样填充多维数组会创建一个（稍微大一点的）副本。为避免复制，“切出”最大的可分块数组，对其进行整形并分别处理剩余的边框。

根据您的数据需要如何处理，这可能非常不切实际。

使用列表组合

我认为拆分实现有更简单/可读的版本。使用numpy.split() 或只是花哨的索引。

import numpy as np

a = np.arange(1009)

chunk_size = 50

%timeit np.split(a, range(chunk_size, a.shape[0], chunk_size))
%timeit [a[i:i+chunk_size] for i in range(0, a.shape[0], chunk_size)]

显示列表 comp 在返回相同结果的同时快约 3 倍：

36.8 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
10.4 µs ± 2.48 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

我猜列表理解的加速应该直接转化为更高维的数组。 array_split 的numpy's implementation 基本上做到了这一点，但还允许在任意轴上进行分块。然而，列表组合也可以扩展来做到这一点。

【讨论】：

【解决方案3】：

通过简单地使用np.array_split 和天花板分割，我们可以相对容易地做到这一点。

import numpy as np
import math

max_size = 15
test = np.arrange(101)

result = np.array_split(test, (len(test) + (max_size -1) ) // max_size)

【讨论】：

这种方法是错误的，因为当 len(test) 和 max size 是接近的数字时它会受到影响。例如 len(test) = 11 ， max size = 10，这将返回 1 而不是 2。这可以通过使用 result = np.array_split(test, (len(test) + (max_size -1) ) // max_size)，返回天花板除法而不是地板除法的结果。
我可能对拆分的含义感到困惑，如果您拆分一次，它将产生 2 个部分。但是，array_split 的行为与我的直觉不同。我把它改成了天花板而不是地板，谢谢！

相关

略小于最大值

逻辑

完整功能

精确的形状加上余数

逻辑

完整功能

补充说明

性能

小于最大值的形状

启用numpy的reshape

使用列表组合