在numpy中没有循环的情况下获取集群中元素的坐标答案

【问题标题】：Getting the coordinates of elements in clusters without a loop in numpy在numpy中没有循环的情况下获取集群中元素的坐标
【发布时间】：2021-09-27 11:38:56
【问题描述】：

我有一个二维数组，我在其中使用 ndimage.label() 函数标记集群，如下所示：

import numpy as np
from scipy.ndimage import label

input_array = np.array([[0, 1, 1, 0],
                        [1, 1, 0, 0],
                        [0, 0, 0, 1],
                        [0, 0, 0, 1]])

labeled_array, _ = label(input_array)

# Result:
# labeled_array == [[0, 1, 1, 0],
#                   [1, 1, 0, 0],
#                   [0, 0, 0, 2],
#                   [0, 0, 0, 2]]

我可以获得元素计数、质心或标记集群的边界框。但我也想获得集群中每个元素的坐标。是这样的（数据结构不一定要这样，任何数据结构都可以）：

{
    1: [(0, 1), (0, 2), (1, 0), (1, 1)],  # Coordinates of the elements that have the label "1"
    2: [(2, 3), (3, 3)]  # Coordinates of the elements that have the label "2"
}

我可以遍历标签列表并为它们中的每一个调用np.where()，但我想知道是否有一种方法可以在没有循环的情况下执行此操作，这样会更快？

【问题讨论】：

标签： python numpy scipy cluster-analysis ndimage

【解决方案1】：

您可以制作坐标地图，对其进行排序和拆分：

# Get the indexes (coordinates) of the labeled (non-zero) elements
ind = np.argwhere(labeled_array)

# Get the labels corresponding to those indexes above
labels = labeled_array[tuple(ind.T)]

# Sort both arrays so that lower label numbers appear before higher label numbers. This is not for cosmetic reasons,
# but we will use sorted nature of these label indexes when we use the "diff" method in the next step.
sort = labels.argsort()
ind = ind[sort]
labels = labels[sort]

# Find the split points where a new label number starts in the ordered label numbers
splits = np.flatnonzero(np.diff(labels)) + 1

# Create a data structure out of the label numbers and indexes (coordinates).
# The first argument to the zip is: we take the 0th label number and the label numbers at the split points
# The second argument is the indexes (coordinates), split at split points
# so the length of both arguments to the zip function is the same
result = {k: v for k, v in zip(labels[np.r_[0, splits]],
                               np.split(ind, splits))}

【讨论】：

我接受这个解决方案，因为重量级的东西是以矢量化的方式完成的，最后有一个非常轻量级的理解。因此，它在具有数千个标签的大型数组上运行得非常快。谢谢！
根据我对代码的理解，我添加了 cmets。一个问题：我们可以用标签[0，*splits]甚至np.unique替换np.r_吗？这两个对我来说都比较容易理解，你觉得呢？
@CanolGökel。 np.r_ 是允许标量的 np.concatenate 的简写
我想你可以使用labels[[0] + list(splits)]，但我不喜欢转换为列表只是为了立即转换

【解决方案2】：

方法一：

你可以试试这个，仍然使用字典理解循环：

{k: list(zip(*np.where(labeled_array == k))) for k in range(1,3)}

输出：

{1: [(0, 1), (0, 2), (1, 0), (1, 1)], 2: [(2, 3), (3, 3)]}

方法2（慢）：

这是一种使用 pandas 的方法，可能比 Mad Physicist 的方法慢：

(pd.DataFrame(labeled_array)
  .stack() 
  .reset_index()
  .groupby(0).agg(list)[1:]
  .apply(lambda x: list(zip(*x)), axis=1)
).to_dict()

输出：

{1: [(0, 1), (0, 2), (1, 0), (1, 1)], 2: [(2, 3), (3, 3)]}

使用此数据的时间：

字典理解

每个循环 8.73 µs ± 216 ns（7 次运行的平均值 ± 标准偏差，每次 100000 次循环）

使用地图坐标，排序和拆分：

每个循环 57.3 µs ± 5.55 µs（7 次运行的平均值 ± 标准偏差，每次 10000 个循环）

熊猫

每个循环 5.16 ms ± 283 µs（平均值 ± 标准偏差，7 次运行，每次 100 个循环）

【讨论】：

感谢您的回答和基准测试。我目前正在尝试您和@Mad Physicist 的答案。对于具有数千个标签的大数组，Mad Physicist 的答案似乎工作得更快，但我还没有完成全面检查。