【问题标题】:Indices that intersect and sort two numpy arrays两个 numpy 数组相交和排序的索引
【发布时间】:2015-11-04 18:10:36
【问题描述】:

我有两个 numpy 整数数组,长度都数亿。每个数组中的值都是唯一的,并且每个最初都是未排序的。

我想要每个产生它们排序交集的索引。例如:

x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

那么它们的排序交集是[4, 5, 11],因此我们想要将 x 和 y 中的每一个转换为该数组的索引,因此我们希望它返回:

mx = np.array([0, 3, 6])
my = np.array([2, 1, 4])

从那时起x[mx] == y[my] == np.intersect1d(x, y)

到目前为止,我们唯一的解决方案涉及三种不同的 argsort,因此这似乎不太可能是最优的。

每个值代表一个星系,以防让问题更有趣。

【问题讨论】:

标签: python sorting numpy optimization set-intersection


【解决方案1】:

这是一个基于intersect1d 实现的选项,它相当简单。它需要一个电话到argsort

公认的简单测试通过了。

import numpy as np


def my_intersect(x, y):
    """my_intersect(x, y) -> xm, ym
    x, y: 1-d arrays of unique values
    xm, ym: indices into x and y giving sorted intersection
    """
    # basic idea taken from numpy.lib.arraysetops.intersect1d
    aux = np.concatenate((x, y))
    sidx = aux.argsort()
    # Note: intersect1d uses aux[:-1][aux[1:]==aux[:-1]] here - I don't know why the first [:-1] is necessary
    inidx = aux[sidx[1:]] == aux[sidx[:-1]]

    # quicksort is not stable, so must do some work to extract indices
    # (if stable, sidx[inidx.nonzero()]  would be for x)
    # interlace the two sets of indices, and check against lengths
    xym = np.vstack((sidx[inidx.nonzero()],
                     sidx[1:][inidx.nonzero()])).T.flatten()

    xm = xym[xym < len(x)]
    ym = xym[xym >= len(x)] - len(x)

    return xm, ym


def check_my_intersect(x, y):
    mx, my = my_intersect(x, y)
    assert (x[mx] == np.intersect1d(x, y)).all()

    # not really necessary: np.intersect1d returns a sorted list
    assert (x[mx] == sorted(x[mx])).all()
    assert (x[mx] == y[my]).all()


def random_unique_unsorted(n):
    while True:
        x = np.unique(np.random.randint(2*n, size=n))
        if len(x):
            break
    np.random.shuffle(x)
    return x


x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

check_my_intersect(x, y)


for i in range(20):
    x = random_unique_unsorted(100+i)
    y = random_unique_unsorted(200+i)
    check_my_intersect(x, y)

编辑:“注意”注释令人困惑(使用 ... 作为语音省略号,忘记它也是 Python 运算符)。

【讨论】:

    【解决方案2】:

    你也可以使用np.searchsorted,像这样-

    def searchsorted_based(x,y):
    
        # Get argsort for both x and y
        xsort_idx = x.argsort()
        ysort_idx = y.argsort()
    
        # Sort x and y and store them
        X = x[xsort_idx]
        Y = y[ysort_idx]
    
        # Find positions of Y in X and the matches by the positions that 
        # shift between 'left' and 'right' based searches. 
        # Use the matches posotions to get corresponding argsort for X.
        x1 = np.searchsorted(X,Y,'left') 
        x2 = np.searchsorted(X,Y,'right') 
        out1 = xsort_idx[x1[x2 != x1]]
    
        # Repeat for X in Y findings
        y1 = np.searchsorted(Y,X,'left') 
        y2 = np.searchsorted(Y,X,'right')
        out2 = ysort_idx[y1[y2 != y1]]
    
        return out1, out2
    

    示例运行 -

    In [100]: x = np.array([4, 1, 10, 5, 8, 13, 11])
         ...: y = np.array([20, 5, 4, 9, 11, 7, 25])
         ...: 
    
    In [101]: searchsorted_based(x,y)
    Out[101]: (array([0, 3, 6]), array([2, 1, 4]))
    

    【讨论】:

      【解决方案3】:

      对于纯 numpy 解决方案,您可以执行以下操作:

      1. 使用np.unique分别获取xy中的唯一值和对应的索引:

        # sorted unique values in x and y and the indices corresponding to their first
        # occurrences, such that u_x == x[u_idx_x]
        u_x, u_idx_x = np.unique(x, return_index=True)
        u_y, u_idx_y = np.unique(y, return_index=True)
        
      2. 使用np.intersect1d 查找唯一值的交集:

        # we can assume_unique, which can be faster for large arrays
        i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
        
      3. 最后,使用np.in1d 仅选择与xy 中的唯一值对应的索引,这些索引也恰好在xy 的交集处:

        # it is also safe to assume_unique here
        i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
        i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
        

      将所有这些整合到一个函数中:

      def intersect_indices(x, y):
          u_x, u_idx_x = np.unique(x, return_index=True)
          u_y, u_idx_y = np.unique(y, return_index=True)
          i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
          i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
          i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
          return i_idx_x, i_idx_y
      

      例如:

      x = np.array([4, 1, 10, 5, 8, 13, 11])
      y = np.array([20, 5, 4, 9, 11, 7, 25])
      
      i_idx_x, i_idx_y = intersect_indices(x, y)
      
      print(i_idx_x, i_idx_y)
      # (array([0, 3, 6]), array([2, 1, 4]))
      

      速度测试:

      In [1]: k = 1000000
      
      In [2]: %%timeit x, y = np.random.randint(k, size=(2, k))
      intersect_indices(x, y)
      ....: 
      1 loops, best of 3: 597 ms per loop
      

      更新:

      我最初错过了这样一个事实,即在您的情况下,xy 都只包含唯一值。考虑到这一点,使用间接排序可能会做得更好:

      def intersect_indices_unique(x, y):
          u_idx_x = np.argsort(x)
          u_idx_y = np.argsort(y)
          i_xy = np.intersect1d(x, y, assume_unique=True)
          i_idx_x = u_idx_x[x[u_idx_x].searchsorted(i_xy)]
          i_idx_y = u_idx_y[y[u_idx_y].searchsorted(i_xy)]
          return i_idx_x, i_idx_y
      

      这是一个更实际的测试用例,其中xy 都包含唯一(但部分重叠)的值:

      In [1]: n, k = 10000000, 1000000
      
      In [2]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
      intersect_indices(x, y)
      ....: 
      1 loops, best of 3: 593 ms per loop
      
      In [3]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
      intersect_indices_unique(x, y)
      ....: 
      1 loops, best of 3: 453 ms per loop
      

      @Divakar 的解决方案在性能方面非常相似:

      In [4]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
      searchsorted_based(x, y)
      ....: 
      1 loops, best of 3: 472 ms per loop
      

      【讨论】:

        【解决方案4】:

        也许使用 dict 的纯 Python 解决方案适合您:

        def indices_from_values(a, intersect):
            idx = {value: index for index, value in enumerate(a)}
            return np.array([idx[x] for x in intersect])
        
        intersect = np.intersect1d(x, y)
        mx = indices_from_values(x, intersect)
        my = indices_from_values(y, intersect)
        np.allclose(x[mx], y[my]) and np.allclose(x[mx], np.intersect1d(x, y))
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2019-05-26
          • 2017-05-04
          • 2012-08-03
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2023-01-11
          • 1970-01-01
          相关资源
          最近更新 更多