【问题标题】:Finding duplicate matrices in Python?在 Python 中查找重复矩阵?
【发布时间】:2016-05-02 03:29:27
【问题描述】:

我有一个矩阵a.shape: (80000, 38, 38)。我想检查一下在第一个维度上是否有任何重复或类似的(38,38) 矩阵(在本例中,这些矩阵有 80000 个)。

我可以运行两个for 循环:

for i in range(a.shape[0]):
    for g in range(a.shape[0]):
        if a[i,:,:] - a[g,:,:] < tolerance:
            # save the index here

但这似乎非常低效。我知道有 numpy.unique,但我不确定我是否理解当你有一组二维矩阵时它是如何工作的。

关于有效方法的建议?有没有办法让广播找到所有矩阵中所有元素的差异?

【问题讨论】:

    标签: python numpy matrix duplicates vectorization


    【解决方案1】:

    检测精确的重复块

    这是一种使用lex-sorting的方法-

    # Reshape a to a 2D as required in few places later on
    ar = a.reshape(a.shape[0],-1)
    
    # Get lex-sorted indices
    sortidx = np.lexsort(ar.T)
    
    # Lex-sort reshaped array to bring duplicate rows next to each other.
    # Perform differentiation to check for rows that have at least one non-zero
    # as those represent unique rows and as such those are unique blocks 
    # in axes(1,2) for the original 3D array 
    out = a[sortidx][np.append(True,(np.diff(ar[sortidx],axis=0)!=0).any(1))]
    

    这是另一种方法,将axes=(1,2) 中的每个元素块视为索引元组,以找出其他块之间的唯一性 -

    # Reshape a to a 2D as required in few places later on
    ar = a.reshape(a.shape[0],-1)
    
    # Get dimension shape considering each block in axes(1,2) as an indexing tuple
    dims = np.append(1,(ar[:,:-1].max(0)+1).cumprod())
    
    # Finally get unique indexing tuples' indices that represent unique
    # indices along first axis for indexing into input array and thus get 
    # the desired output of unique blocks along the axes(1,2)
    out = a[np.unique(ar.dot(dims),return_index=True)[1]]
    

    示例运行 -

    1] 输入:

    In [151]: a
    Out[151]: 
    array([[[12,  4],
            [ 0,  1]],
    
           [[ 2,  4],
            [ 3,  2]],
    
           [[12,  4],
            [ 0,  1]],
    
           [[ 3,  4],
            [ 1,  3]],
    
           [[ 2,  4],
            [ 3,  2]],
    
           [[ 3,  0],
            [ 2,  1]]])
    

    2] 输出:

    In [152]: ar = a.reshape(a.shape[0],-1)
         ...: sortidx = np.lexsort(ar.T)
         ...: 
    
    In [153]: a[sortidx][np.append(True,(np.diff(ar[sortidx],axis=0)!=0).any(1))]
    Out[153]: 
    array([[[12,  4],
            [ 0,  1]],
    
           [[ 3,  0],
            [ 2,  1]],
    
           [[ 2,  4],
            [ 3,  2]],
    
           [[ 3,  4],
            [ 1,  3]]])
    
    In [154]: dims = np.append(1,(ar[:,:-1].max(0)+1).cumprod())
    
    In [155]: a[np.unique(ar.dot(dims),return_index=True)[1]]
    Out[155]: 
    array([[[12,  4],
            [ 0,  1]],
    
           [[ 3,  0],
            [ 2,  1]],
    
           [[ 2,  4],
            [ 3,  2]],
    
           [[ 3,  4],
            [ 1,  3]]])
    

    检测相似块

    对于相似性标准,假设您的意思是 (a[i,:,:] - a[g,:,:]).all() &lt; tolerance 的绝对值,这是一种矢量化方法,用于获取输入数组中沿 axes(1,2) 的所有相似块的索引 -

    R,C = np.triu_indices(a.shape[0],1)
    mask = (np.abs(a[R] - a[C]) < tolerance).all(axis=(1,2))
    I,G = R[mask], C[mask]
    

    示例运行 -

    In [267]: a
    Out[267]: 
    array([[[12,  4],
            [ 0,  1]],
    
           [[ 2,  4],
            [ 3,  2]],
    
           [[13,  4],
            [ 0,  1]],
    
           [[ 3,  4],
            [ 1,  3]],
    
           [[ 2,  4],
            [ 3,  2]],
    
           [[12,  5],
            [ 1,  1]]])
    
    In [268]: tolerance = 2
    
    In [269]: R,C = np.triu_indices(a.shape[0],1)
         ...: mask = (np.abs(a[R] - a[C]) < tolerance).all(axis=(1,2))
         ...: I,G = R[mask], C[mask]
         ...: 
    
    In [270]: I
    Out[270]: array([0, 0, 1, 2])
    
    In [271]: G
    Out[271]: array([2, 5, 4, 5])
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2014-04-21
      • 2021-04-04
      • 1970-01-01
      • 2019-07-17
      • 2018-02-12
      • 2017-03-13
      • 2017-08-20
      • 1970-01-01
      相关资源
      最近更新 更多