在两个数组中查找唯一元素索引的 Pythonic 方法答案

【问题标题】：Pythonic way of finding indexes of unique elements in two arrays在两个数组中查找唯一元素索引的 Pythonic 方法
【发布时间】：2021-03-24 12:42:14
【问题描述】：

我有两个类似于这些的排序的 numpy 数组：

x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])

元素永远不会在同一个数组中重复。我想找出一种 pythonicaly 方法来找出包含数组中存在相同元素的位置的索引列表。

例如，1 存在于索引 0 处的 x 和 y。 x 中的元素 2 在 y 中不存在，所以我不关心那个项目。但是，8 确实存在于两个数组中——在x 中的索引2 中，但在y 中的索引1 中。同样，15 存在于两者中，在x 中的索引4 中，但在y 中的索引2。所以我的函数的结果将是一个列表，在这种情况下返回[[0, 0], [2, 1], [4, 2]]。

到目前为止，我正在做的是：

def get_indexes(x, y):
    indexes = []
    for i in range(len(x)):
        # Find index where item x[i] is in y:
        j = np.where(x[i] == y)[0]

        # If it exists, save it:
        if len(j) != 0:
            indexes.append([i, j[0]])

    return indexes

但问题是数组x 和y非常很大（数百万个项目），所以需要相当长的时间。有没有更好的 pythonic 方法来做到这一点？

【问题讨论】：

这能回答你的问题吗？ 'in' for two sorted lists with the lowest complexity
嗨@Tomerikoo，感谢您的指点！我认为这已经足够不同了，因为我对 indexes 很感兴趣，而不仅仅是它们同时存在与否。我认为这是这个问题的额外复杂性？
@Tomerikoo--似乎该链接中的所有答案都使用显式 Python 循环，这比这里避免这种情况的几个答案要慢。
好吧，那个问题的 OP 实际上说了一些关于索引的事情，每个人都优雅地决定忽略 ^_^。确实不一样。将删除我的投票关闭，将留下评论，因为它们是相关的

标签： python arrays

【解决方案1】：

没有 Python 循环

代码

def get_indexes_darrylg(x, y):
    ' darrylg answer '
    # Use intersect to find common elements between two arrays
    overlap = np.intersect1d(x, y)
    
    # Indexes of common elements in each array
    loc1 = np.searchsorted(x, overlap)
    loc2 = np.searchsorted(y, overlap)
    
    # Result is the zip two 1d numpy arrays into 2d array
    return np.dstack((loc1, loc2))[0]

用法

x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])
result = get_indexes_darrylg(x, y)

# result[0]: array([[0, 0],
                    [2, 1],
                    [4, 2]], dtype=int64)

定时发布解决方案

结果表明 darrlg 代码的运行时间最快。

代码调整

作为函数发布的每个解决方案。
轻微修改，以便每个解决方案输出一个 numpy 数组。
以海报命名的曲线

代码

import numpy as np
import perfplot

def create_arr(n):
    ' Creates pair of 1d numpy arrays with half the elements equal '
    max_val = 100000     # One more than largest value in output arrays
    
    arr1 = np.random.randint(0, max_val, (n,))
    arr2 = arr1.copy()
    
    # Change half the elements in arr2
    all_indexes = np.arange(0, n, dtype=int)
    indexes = np.random.choice(all_indexes, size = n//2, replace = False) # locations to make changes
    
    
    np.put(arr2, indexes, np.random.randint(0, max_val, (n//2, )))        # assign new random values at change locations
   
    arr1 = np.sort(arr1)
    arr2 = np.sort(arr2)
    
    return (arr1, arr2)

def get_indexes_lllrnr101(x,y):
    ' lllrnr101 answer '
    ans = []
    i=0
    j=0
    while (i<len(x) and j<len(y)):
        if x[i] == y[j]:
            ans.append([i,j])
            i += 1
            j += 1
        elif (x[i]<y[j]):
            i += 1
        else:
            j += 1
    return np.array(ans)

def get_indexes_joostblack(x, y):
    'joostblack'
    indexes = []
    for idx,val in enumerate(x):
        idy = np.searchsorted(y,val)
        try:
            if y[idy]==val:
                indexes.append([idx,idy])
        except IndexError:
            continue  # ignore index errors
            
    return np.array(indexes)

def get_indexes_mustafa(x, y):
    indices_in_x = np.flatnonzero(np.isin(x, y))                 # array([0, 2, 4])
    indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x]))   # array([0, 1, 2]
    
    return np.array(list(zip(indices_in_x, indices_in_y)))

def get_indexes_darrylg(x, y):
    ' darrylg answer '
    # Use intersect to find common elements between two arrays
    overlap = np.intersect1d(x, y)
    
    # Indexes of common elements in each array
    loc1 = np.searchsorted(x, overlap)
    loc2 = np.searchsorted(y, overlap)
    
    # Result is the zip two 1d numpy arrays into 2d array
    return np.dstack((loc1, loc2))[0]

def get_indexes_akopcz(x, y):
    ' akopcz answer '
    return np.array([
        [i, j]
        for i, nr in enumerate(x)
        for j in np.where(nr == y)[0]
    ])

perfplot.show(
    setup = create_arr,  # tuple of two 1D random arrays
    kernels=[
        lambda a: get_indexes_lllrnr101(*a),
        lambda a: get_indexes_joostblack(*a),
        lambda a: get_indexes_mustafa(*a),
        lambda a: get_indexes_darrylg(*a),
        lambda a: get_indexes_akopcz(*a),
    ],
    labels=["lllrnr101", "joostblack", "mustafa", "darrylg", "akopcz"],
    n_range=[2 ** k for k in range(5, 21)],
    xlabel="Array Length",
    # More optional arguments with their default values:
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    equality_check=None, #np.allclose,  # set to None to disable "correctness" assertion
    # show_progress=True,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
)

【讨论】：

【解决方案2】：

您正在做的是 O(nlogn)，这已经足够了。
如果需要，您可以在 O(n) 中通过使用两个指针对两个数组进行迭代来完成此操作，并且由于它们已排序，因此为具有较小对象的数组增加指针。

见下文：

x = [1, 2, 8, 11, 15]
y = [1, 8, 15, 17, 20, 21]

def get_indexes(x,y):
    ans = []
    i=0
    j=0
    while (i<len(x) and j<len(y)):
        if x[i] == y[j]:
            ans.append([i,j])
            i += 1
            j += 1
        elif (x[i]<y[j]):
            i += 1
        else:
            j += 1
    return ans

print(get_indexes(x,y))

这给了我：

[[0, 0], [2, 1], [4, 2]]

【讨论】：

【解决方案3】：

你可以使用numpy.searchsorted:

def get_indexes(x, y):
    indexes = []
    for idx,val in enumerate(x):
        idy = np.searchsorted(y,val)
        if y[idy]==val:
            indexes.append([idx,idy])
    return indexes

【讨论】：

【解决方案4】：

一种解决方案是首先从x 的一侧查看y 中包含哪些值，方法是通过np.isin 和np.flatnonzero 获取它们的索引，然后从另一侧使用相同的过程；但我们并没有完全给出x，而是只给出（已经找到的）相交元素来获得时间：

indices_in_x = np.flatnonzero(np.isin(x, y))                 # array([0, 2, 4])
indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x]))   # array([0, 1, 2])

现在你可以zip他们得到结果：

result = list(zip(indices_in_x, indices_in_y))               # [(0, 0), (2, 1), (4, 2)]

【讨论】：

【解决方案5】：

虽然，此函数会在y 数组中搜索所有出现的x[i]，如果y 中不允许重复，它将只找到一次x[i]。

def get_indexes(x, y):
    return [
        [i, j]
        for i, nr in enumerate(x)
        for j in np.where(nr == y)[0]
    ]

【讨论】：