【问题标题】：How to efficiently find the indices a first array values matching with a second array values?如何有效地找到与第二个数组值匹配的第一个数组值的索引？
【发布时间】：2021-06-13 14:48:04
【问题描述】：

我有两个 numpy 数组 A 和 B。A 的形状为 (10000000, 3)，B 的形状为 (1000000, 3)。这两个数组都是 XYZ 坐标，因此 B 对应于 A 的某个区域。我必须找到 A 的索引，这些索引对应于值 B。现在我正在解决如下。我需要一些帮助来使用 Numpy 或其他 python 包进行优化。

extract_BinA=np.empty(B.shape[0])
for i in range(B.shape[0]):
    for j in range(A.shape[0]):
        if(A[j][0]==B[i][0] and A[j][1]==B[i][1] and A[j][2]==B[i][2]):
            extract_BinA[i]=j

【问题讨论】：

这些是巨大的数组 - 正如您所发现的，两个 for 循环方法无法扩展。为了开发更好的方法，我们需要了解更多信息，例如您要对多少元素进行二次采样，以及数组是否稀疏？
感谢您的快速回复。我想得到B对应的所有索引。数组很密集。

标签： python performance numpy

【解决方案1】：

这里的问题不是纯python代码的速度，而是算法本身。您可以使用 sorted-arrays 或 hash-tables 将算法的复杂度提高到 O(n log n) 甚至 O(n) 而不是当前缓慢的 O(n^2) 解决方案（以及@Mazen 提出的解决方案）。 O(n^2) 在这里效率不高，因为它会产生大约 10,000,000 * 10,000,000 = 100,000 billion operations，这对于任何现代计算机来说都太多了。

这是一个纯 Python 中的哈希表解决方案：

table = {tuple(A[i]):i for i in range(A.shape[0])}
extract_BinA = np.empty(B.shape[0])
for i in range(B.shape[0]):
    val = tuple(B[i])
    if val in table:
        extract_BinA[i] = table[val]

请注意，如果A 的同一位置有多个点，结果可能会有所不同。

这是一个包含两个大小为 10,000 的随机数组的基准：

Initial solution: 53.82 s
Mazen solution:    1.76 s
This solution:     0.02 s

在这个小输入上，上述代码比初始解决方案快 2700 倍，比建议的替代解决方案快 88 倍。在更大的输入下，差距会更大，并且上述代码比其他两种解决方案快很多数量级（即快 >10000 倍）。

更新：

如果A 中有多个彼此相等的点，则可以修改字典以存储索引列表而不是一个值。或者，可以创建字典，以便像在原始代码中一样保留第一个值。以下是两种解决方案的示例：

table = dict()
for i in range(A.shape[0])
    key = tuple(A[i])
    if key in table:
        table[key].append(i)
    else:
        table[key] = [i]

extract_BinA = np.empty(B.shape[0])
for i in range(B.shape[0]):
    val = tuple(B[i])
    if val in table:
        # Here table[val] is a list and thus you 
        # can do whatever you want with the indices. 
        # For example you can take the first one like here, 
        # or possibly the last as you want.
        extract_BinA[i] = table[val][0]

# Select always directly the first index
table = dict()
for i in range(A.shape[0])
    key = tuple(A[i])
    if key not in table:
        table[key] = i

extract_BinA = np.empty(B.shape[0])
for i in range(B.shape[0]):
    val = tuple(B[i])
    if val in table:
        extract_BinA[i] = table[val]

请注意，这些解决方案比上面的代码要慢一些，但复杂度仍然是线性的（因此仍然非常快）。

【讨论】：

你的回答很有道理，我认为它更好。 OP，您能否将此答案设置为您问题的正确答案。
您的解决方案使用 A = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12], [13,14,15]]) B = np.array([[0,0,0],[4,5,6],[13,14,15]]) 是 [1. 1. 4.] 所以我错过了什么，为什么 1 有两次？
@pippo1980 当我像 OP 一样使用np.empty 时，应该小心。如果没有写入值，则该值未定义。我第一次测试代码时遇到了同样的问题。但是，原始代码也这样做，所以我只是复制了相同的行为。当使用np.ones(B.shape[0]) * -1 初始化时，三种方法的结果是相同的。
@nhm 我编辑了答案以提供涵盖此案例的示例
@nhm 使用严格相等比较浮点数通常是bad idea。如果您没有完全相同的点，您可能需要一种基于距离的方法。问题是使用基于距离的方法会使问题变得更加复杂，因此解决方案也更加复杂。通常的方法是在您的情况下使用octree，它可以在O(n log(n)) 中解决这个问题。或者，如果您知道数字准确且有（相当低的）有界误差范围，您仍然可以使用带有舍入的哈希图。

【解决方案2】：

解决方案

extract_BinA = np.ones(B.shape[0]) * -1
for i, b in enumerate(B):
    idx = np.argwhere((A == b) == [True, True, True])
    if idx.any():
        extract_BinA[i] = idx[0][0]
print(extract_BinA)

说明

将extract_BinA 设置为大小为 B 的负值数组

extract_BinA = np.ones(B.shape[0]) * -1

为了获得 B 元素等于 A 元素的元素的索引，我们需要执行以下操作：

(A == b)

将 B 中的一行的 x,y,z 与 A 中的每个 x,y,z 行进行比较

(A == b) == [真，真，真]

仅比较 x_a==x_b、y_a==y_b 和 z_a==z_b 对所有元素均产生 True 的元素

np.argwhere((A == b) == [真，真，真])

返回一组条件为真的索引

要测试的完整示例：

import numpy as np
A = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]])
B = np.array([[0,0,0],[4,5,6],[13,14,15]])
# your code
extract_BinA=np.ones(B.shape[0]) * -1
for i in range(B.shape[0]):
    for j in range(A.shape[0]):
        if (A[j] == B[i]).all():
            extract_BinA[i]=j
print(extract_BinA)
# my code
extract_BinA = np.ones(B.shape[0]) * -1
for i, b in enumerate(B):
    idx = np.argwhere((A == b) == [True, True, True])
    if idx.any():
        extract_BinA[i] = idx[0][0] -----------> changed  extract_BinB to  extract_BinA
print(extract_BinA)

【讨论】：