在 Python 中查找重复矩阵？答案

【问题标题】：Finding duplicate matrices in Python?在 Python 中查找重复矩阵？
【发布时间】：2016-05-02 03:29:27
【问题描述】：

我有一个矩阵a.shape: (80000, 38, 38)。我想检查一下在第一个维度上是否有任何重复或类似的(38,38) 矩阵（在本例中，这些矩阵有 80000 个）。

我可以运行两个for 循环：

for i in range(a.shape[0]):
    for g in range(a.shape[0]):
        if a[i,:,:] - a[g,:,:] < tolerance:
            # save the index here

但这似乎非常低效。我知道有 numpy.unique，但我不确定我是否理解当你有一组二维矩阵时它是如何工作的。

关于有效方法的建议？有没有办法让广播找到所有矩阵中所有元素的差异？

【问题讨论】：

标签： python numpy matrix duplicates vectorization

【解决方案1】：

检测精确的重复块

这是一种使用lex-sorting的方法-

# Reshape a to a 2D as required in few places later on
ar = a.reshape(a.shape[0],-1)

# Get lex-sorted indices
sortidx = np.lexsort(ar.T)

# Lex-sort reshaped array to bring duplicate rows next to each other.
# Perform differentiation to check for rows that have at least one non-zero
# as those represent unique rows and as such those are unique blocks 
# in axes(1,2) for the original 3D array 
out = a[sortidx][np.append(True,(np.diff(ar[sortidx],axis=0)!=0).any(1))]

这是另一种方法，将axes=(1,2) 中的每个元素块视为索引元组，以找出其他块之间的唯一性 -

# Reshape a to a 2D as required in few places later on
ar = a.reshape(a.shape[0],-1)

# Get dimension shape considering each block in axes(1,2) as an indexing tuple
dims = np.append(1,(ar[:,:-1].max(0)+1).cumprod())

# Finally get unique indexing tuples' indices that represent unique
# indices along first axis for indexing into input array and thus get 
# the desired output of unique blocks along the axes(1,2)
out = a[np.unique(ar.dot(dims),return_index=True)[1]]

示例运行 -

1] 输入：

In [151]: a
Out[151]: 
array([[[12,  4],
        [ 0,  1]],

       [[ 2,  4],
        [ 3,  2]],

       [[12,  4],
        [ 0,  1]],

       [[ 3,  4],
        [ 1,  3]],

       [[ 2,  4],
        [ 3,  2]],

       [[ 3,  0],
        [ 2,  1]]])

2] 输出：

In [152]: ar = a.reshape(a.shape[0],-1)
     ...: sortidx = np.lexsort(ar.T)
     ...: 

In [153]: a[sortidx][np.append(True,(np.diff(ar[sortidx],axis=0)!=0).any(1))]
Out[153]: 
array([[[12,  4],
        [ 0,  1]],

       [[ 3,  0],
        [ 2,  1]],

       [[ 2,  4],
        [ 3,  2]],

       [[ 3,  4],
        [ 1,  3]]])

In [154]: dims = np.append(1,(ar[:,:-1].max(0)+1).cumprod())

In [155]: a[np.unique(ar.dot(dims),return_index=True)[1]]
Out[155]: 
array([[[12,  4],
        [ 0,  1]],

       [[ 3,  0],
        [ 2,  1]],

       [[ 2,  4],
        [ 3,  2]],

       [[ 3,  4],
        [ 1,  3]]])

检测相似块

对于相似性标准，假设您的意思是 (a[i,:,:] - a[g,:,:]).all() < tolerance 的绝对值，这是一种矢量化方法，用于获取输入数组中沿 axes(1,2) 的所有相似块的索引 -

R,C = np.triu_indices(a.shape[0],1)
mask = (np.abs(a[R] - a[C]) < tolerance).all(axis=(1,2))
I,G = R[mask], C[mask]

示例运行 -

In [267]: a
Out[267]: 
array([[[12,  4],
        [ 0,  1]],

       [[ 2,  4],
        [ 3,  2]],

       [[13,  4],
        [ 0,  1]],

       [[ 3,  4],
        [ 1,  3]],

       [[ 2,  4],
        [ 3,  2]],

       [[12,  5],
        [ 1,  1]]])

In [268]: tolerance = 2

In [269]: R,C = np.triu_indices(a.shape[0],1)
     ...: mask = (np.abs(a[R] - a[C]) < tolerance).all(axis=(1,2))
     ...: I,G = R[mask], C[mask]
     ...: 

In [270]: I
Out[270]: array([0, 0, 1, 2])

In [271]: G
Out[271]: array([2, 5, 4, 5])

【讨论】：