将 2D numpy 数组聚类成更小的 4D numpy 数组答案

【问题标题】：Clustering 2D numpy arrays into smaller 4D numpy arrays将 2D numpy 数组聚类成更小的 4D numpy 数组
【发布时间】：2025-12-24 16:30:12
【问题描述】：

尊敬的 * 用户，

我的 python 脚本面临性能问题，因为我必须处理包含超过 10 亿个元素的二维表，以获取数百个输入文件的潜在列表。我一直在用 numpy 数组操作调用替换我的嵌套 for 循环，在这个过程中，我发现 numpy.take （根据一组索引查找元素）和 numpy.outer （评估两个一维数组元素之间的所有可能乘积）非常有用。这些函数让我可以在可以使用它们的地方将代码的性能提高数百倍。

但是在我的代码中仍有一个地方存在问题，这个地方是我将包含十亿个元素的二维数组聚集成一个元素少得多（比如几千个）的 4D 数组。具体来说，我有两个索引列表，其大小等于矩阵的行数（这是一个方阵）。

第一个索引列表是 th_t，第二个列表是 dm_t，矩阵是 p_contact。聚集元素的 4D 数组被命名为 rc_p。聚类过程如下嵌套for循环：

import numpy as np

th_t = [1, 3, 2, 1, 1, 3, 3, 0, 1, 0, 2, 1]
dm_t = [0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0]
n_th = len(set(th_t))
n_dm = len(set(dm_t))
p_contact = [[0.0129, 0.0134, 0.0062, 0.0021, 0.0107, 0.0106, 0.0076, 0.0134, 0.0087, 0.0031, 0.0026, 0.0114]
[0.0123, 0.0021, 0.0033, 0.0120, 0.0099, 0.0125, 0.0001, 0.0018, 0.0030, 0.0059, 0.0038, 0.0125]
[0.0082, 0.0125, 0.0004, 0.0120, 0.0040, 0.0108, 0.0101, 0.0063, 0.0072, 0.0098, 0.0017, 0.0121]
[0.0096, 0.0008, 0.0073, 0.0100, 0.0123, 0.0104, 0.0077, 0.0025, 0.0106, 0.0126, 0.0031, 0.0033]
[0.0112, 0.0091, 0.0134, 0.0002, 0.0129, 0.0081, 0.0087, 0.0036, 0.0102, 0.0002, 0.0019, 0.0131]
[0.0099, 0.0081, 0.0037, 0.0004, 0.0135, 0.0005, 0.0025, 0.0086, 0.0091, 0.0016, 0.0130, 0.0011]
[0.0078, 0.0005, 0.0044, 0.0089, 0.0127, 0.0106, 0.0113, 0.0048, 0.0057, 0.0133, 0.0077, 0.0033]
[0.0017, 0.0010, 0.0048, 0.0052, 0.0113, 0.0066, 0.0133, 0.0092, 0.0020, 0.0125, 0.0011, 0.0023]
[0.0027, 0.0124, 0.0096, 0.0047, 0.0134, 0.0020, 0.0129, 0.0114, 0.0087, 0.0114, 0.0090, 0.0001]
[0.0032, 0.0014, 0.0038, 0.0114, 0.0058, 0.0017, 0.0089, 0.0057, 0.0022, 0.0056, 0.0046, 0.0094]
[0.0033, 0.0020, 0.0042, 0.0040, 0.0110, 0.0016, 0.0100, 0.0014, 0.0087, 0.0123, 0.0004, 0.0031]
[0.0010, 0.0029, 0.0054, 0.0015, 0.0064, 0.0060, 0.0131, 0.0064, 0.0073, 0.0097, 0.0132, 0.0092]]
n_sg = len(p_contact)
rc_p = np.zeros((n_th, n_dm, n_th, n_dm)) 
for i in range(n_sg): #n_sg can be about 40000
        for j in range(n_sg):
            rc_p[th_t[i]][dm_t[i]][th_t[j]][dm_t[j]] += p_contact[i][j]

我尝试使用各种 numpy 函数来避免嵌套十亿个元素的 for 循环，最后我得到了以下过程：

import numpy as np

th_t = [1, 3, 2, 1, 1, 3, 3, 0, 1, 0, 2, 1]
dm_t = [0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0]
n_th = len(set(th_t))
n_dm = len(set(dm_t))
p_contact = [[0.0129, 0.0134, 0.0062, 0.0021, 0.0107, 0.0106, 0.0076, 0.0134, 0.0087, 0.0031, 0.0026, 0.0114]
[0.0123, 0.0021, 0.0033, 0.0120, 0.0099, 0.0125, 0.0001, 0.0018, 0.0030, 0.0059, 0.0038, 0.0125]
[0.0082, 0.0125, 0.0004, 0.0120, 0.0040, 0.0108, 0.0101, 0.0063, 0.0072, 0.0098, 0.0017, 0.0121]
[0.0096, 0.0008, 0.0073, 0.0100, 0.0123, 0.0104, 0.0077, 0.0025, 0.0106, 0.0126, 0.0031, 0.0033]
[0.0112, 0.0091, 0.0134, 0.0002, 0.0129, 0.0081, 0.0087, 0.0036, 0.0102, 0.0002, 0.0019, 0.0131]
[0.0099, 0.0081, 0.0037, 0.0004, 0.0135, 0.0005, 0.0025, 0.0086, 0.0091, 0.0016, 0.0130, 0.0011]
[0.0078, 0.0005, 0.0044, 0.0089, 0.0127, 0.0106, 0.0113, 0.0048, 0.0057, 0.0133, 0.0077, 0.0033]
[0.0017, 0.0010, 0.0048, 0.0052, 0.0113, 0.0066, 0.0133, 0.0092, 0.0020, 0.0125, 0.0011, 0.0023]
[0.0027, 0.0124, 0.0096, 0.0047, 0.0134, 0.0020, 0.0129, 0.0114, 0.0087, 0.0114, 0.0090, 0.0001]
[0.0032, 0.0014, 0.0038, 0.0114, 0.0058, 0.0017, 0.0089, 0.0057, 0.0022, 0.0056, 0.0046, 0.0094]
[0.0033, 0.0020, 0.0042, 0.0040, 0.0110, 0.0016, 0.0100, 0.0014, 0.0087, 0.0123, 0.0004, 0.0031]
[0.0010, 0.0029, 0.0054, 0.0015, 0.0064, 0.0060, 0.0131, 0.0064, 0.0073, 0.0097, 0.0132, 0.0092]]

#prepare the flattened list of index pairs
th_t        = np.asarray(th_t)
dm_t        = np.asarray(dm_t)
thdm_stack  = np.stack((th_t, dm_t))
thdm_stack  = np.transpose(thdm_stack)
thdm_table  = np.asarray(list(product(thdm_stack, thdm_stack)))
p_contact_f = p_contact.flatten()

#calculate clustered probabilities for each contact type
rc_p                 = np.zeros((n_th, n_dm, n_th, n_dm)) 

for th1 in range(n_th):
    for dm1 in range(n_dm):
        for th2 in range(n_th):
            for dm2 in range(n_dm):
                to_find                       = np.zeros((2, 2))
                to_find[0][0]                 = th1
                to_find[0][1]                 = dm1
                to_find[1][0]                 = th2
                to_find[1][1]                 = dm2
                condition                     = np.isin(thdm_table, to_find)
                condition                     = np.all(condition, axis=(1, 2))
                to_add                        = np.extract(condition, p_contact_f)
                rc_p[th1][dm1][th2][dm2]      = np.sum(to_add)

这最终比原始过程慢，而不是更快，可能是因为我必须生成一个 10 亿大小的布尔矩阵，并在 4D for 循环的数千个步骤中处理它（它的元素比最初的 for 循环，只是为了提醒）。

那么，你们中是否有人知道如何替换这个昂贵的嵌套 for 循环并最大限度地利用 numpy 的底层 C 代码将这个大的 2D 矩阵聚集到更小的 4D 数组中？

请注意，这些数组中的各个元素是概率。 2D 数组和 4D 集群数组中所有元素的总和为 1，并且“集群”是指将概率分组为类型（显示相同索引集的 2D 矩阵的所有单元格将它们的概率加到一个4D 聚集数组元素）。

一切顺利！

【问题讨论】：

什么是i_th 和i_dm？
你能加一个minimal reproducible example吗？基本上我们需要一些测试数组和一个预期的输出，即使它比你的实际用例小得多，这样我们就知道输出应该是什么。
最后，您需要numpy 唯一的解决方案，还是我们也可以使用scipy？
好的，感谢您的评论。我可以接受 scipy 解决方案。我唯一想做的就是加速这段代码，这要归功于将 python 代码投影到任何库提供的底层 C 代码上的可能性。我将立即使示例最小化且可验证。
你绑定到python 2.7了吗？你看过numba吗？

标签： arrays python-2.7 numpy

【解决方案1】：

您实际上并没有迭代四个维度，而是迭代了 2 个维度：i 和 j。您可以将np.ravel_multi_index 一起使用th_t 和dm_t 数组以将问题减少到2d，并在最后将reshape 减少到4d：

idx = np.ravel_multi_index((th_t, dm_t), (n_th, n_dm))
rc_p = np.zeros((n_th * n_dm, n_th * n_dm))
for i in range (idx.size):
    np.add.at(rc_p[idx[i]], idx, p_contact[i])
rc_p = rc_p.reshape(n_th, n_dm, n_th, n_dm)

或者，如果您可以使用 numba，只需将您的初始循环代码包装在 @jit 中，这将对其进行 c 编译

from numba import jit

@jit
def foo(p_contact, th_t, dm_t, n_th, n_dm):
    n_sg = len(p_contact)
    rc_p = np.zeros((n_th, n_dm, n_th, n_dm)) 
    for i in range(n_sg): 
        for j in range(n_sg):
            rc_p[th_t[i]][dm_t[i]][th_t[j]][dm_t[j]] += p_contact[i][j]

【讨论】：

谢谢！我使用了第一个 sn-p 代码，当我将 rc_p.reshape(n_th, n_dm, n_th, n_dm) 替换为 rc_p = rc_p.reshape(n_th, n_dm, n_th, n_dm) 在我的脚本中。

【解决方案2】：

我想强调一下 Daniel F 提出的非常有用的功能，我不知道它是解决此问题的关键：

numpy.ravel_multi_index

它可以将索引序列转换为一维索引列表。例如，对于基于 2 和 9 索引的两个列表的索引对，这个 numpy 函数输出的 1,4 索引是第 14 个索引（9 个索引加 5）。理解起来有点棘手，但是非常强大。

【讨论】：