Numpy数组仅显示唯一行答案

【问题标题】：Numpy array show only unique rowsNumpy数组仅显示唯一行
【发布时间】：2015-11-18 17:14:49
【问题描述】：

我想要一个数组的行，它们是唯一的。与 numpy 的 unique 函数相反，我想排除所有出现多次的行。

所以输入：

[[1,1],[1,1],[1,2],[2,3],[3,4],[3,4]]

应该导致输出

[[1,2],[2,3]].

我尝试用np.unique(array, return_counts=True) 计算每一行的出现次数，然后用>1 过滤结果。我正在寻找一种更有效的方法来做到这一点，以及在不返回计数的情况下做同样的事情，因为它们不是在 numpy 1.9 之前实现的。

更新： 在我的情况下，数据大小始终是 [m,2]，但是一旦建立了这个概念，它应该很容易转移到 [m,n] 的情况下。在我的特殊情况下，数据集由整数组成，但解决方案不必局限于该假设。一个典型的数据集将有m ~ 10^7。

【问题讨论】：

输入数组的数据大小是多少？它们总是整数吗？
查看答案here 以计算行频，然后使用布尔索引。
我认为它不会比这更有效，因为创建计数字典将是 O(N)。您可以使用collections.Counter，如果您不想使用 numpy，它应该做同样的事情。
这几乎是 stackoverflow.com/q/16970982/1461210 的副本，除了您还想排除所有多次出现的行，而不是排除除副本之一。
您的示例显示了一个形状为 (m, 2) 的数组，该数组中的值是小整数。这是典型的数据吗？或者数组可能是 (m, n) 且 n > 2，或者包含没有先验限制在值上的整数或浮点数？

标签： python arrays numpy unique

【解决方案1】：

方法#1

这是使用lex-sorting 和np.bincount 的一种方法-

# Perform lex sort and get the sorted array version of the input
sorted_idx = np.lexsort(A.T)
sorted_Ar =  A[sorted_idx,:]

# Mask of start of each unique row in sorted array 
mask = np.append(True,np.any(np.diff(sorted_Ar,axis=0),1))

# Get counts of each unique row
unq_count = np.bincount(mask.cumsum()-1) 

# Compare counts to 1 and select the corresponding unique row with the mask
out = sorted_Ar[mask][np.nonzero(unq_count==1)[0]]

请注意，输出不会保持输入数组中最初存在的元素顺序。

方法 #2

如果元素是整数，那么您可以将二维数组A 转换为一维数组，假设每一行都是一个索引元组，这应该是一个非常有效的解决方案。另外，请注意，这种方法会保持输出中元素的顺序。实施将是 -

# Convert 2D array A to a 1D array assuming each row as an indexing tuple
A_1D = A.dot(np.append(A.max(0)[::-1].cumprod()[::-1][1:],1))

# Get sorting indices for the 1D array
sort_idx = A_1D.argsort()

# Mask of start of each unique row in 1D sorted array 
mask = np.append(True,np.diff(A_1D[sort_idx])!=0)

# Get the counts of each unique 1D element
counts = np.bincount(mask.cumsum()-1)

# Select the IDs with counts==1 and thus the unique rows from A
out = A[sort_idx[np.nonzero(mask)[0][counts==1]]]

运行时测试和验证

函数-

def unq_rows_v1(A):
    sorted_idx = np.lexsort(A.T)
    sorted_Ar =  A[sorted_idx,:]
    mask = np.append(True,np.any(np.diff(sorted_Ar,axis=0),1))
    unq_count = np.bincount(mask.cumsum()-1) 
    return sorted_Ar[mask][np.nonzero(unq_count==1)[0]]

def unq_rows_v2(A):
    A_1D = A.dot(np.append(A.max(0)[::-1].cumprod()[::-1][1:],1))
    sort_idx = A_1D.argsort()
    mask = np.append(True,np.diff(A_1D[sort_idx])!=0)
    return A[sort_idx[np.nonzero(mask)[0][np.bincount(mask.cumsum()-1)==1]]]

计时和验证输出 -

In [272]: A = np.random.randint(20,30,(10000,5))

In [273]: unq_rows_v1(A).shape
Out[273]: (9051, 5)

In [274]: unq_rows_v2(A).shape
Out[274]: (9051, 5)

In [275]: %timeit unq_rows_v1(A)
100 loops, best of 3: 5.07 ms per loop

In [276]: %timeit unq_rows_v2(A)
1000 loops, best of 3: 1.96 ms per loop

【讨论】：

【解决方案2】：

numpy_indexed 包（免责声明：我是它的作者）能够以完全矢量化的方式有效地解决这个问题。我还没有用 numpy 测试过 1.9，如果这仍然相关，但也许你愿意试一试并让我知道。我没有任何理由相信它不适用于旧版本的 numpy。

a = np.random.rand(10000, 3).round(2)
unique, count = npi.count(a)
print(unique[count == 1])

请注意，根据您的原始问题，此解决方案不限于特定数量的列或 dtype。

【讨论】：