有效处理 Python 列表中的重复项答案

【问题标题】：Efficiently handling duplicates in a Python list有效处理 Python 列表中的重复项
【发布时间】：2016-10-18 15:22:35
【问题描述】：

我希望在 Python 列表/一维 numpy 数组中紧凑地表示重复项。例如，假设我们有

 x = np.array([1, 0, 0, 3, 3, 0])

这个数组有几个重复的元素，可以用

表示

 group_id = np.array([0, 1, 1, 2, 2, 1])

以便使用x[group_id==<some_id>] 找到给定集群中的所有重复项。

可以通过排序有效地计算重复对的列表，

s_idx = np.argsort(x)
diff_idx = np.nonzero(x[s_idx[:-1]] == x[s_idx[1:]])[0]

s_idx[diff_idx] 对 s_idx[diff_idx+1] 对应于原始数组中重复的索引。（这里是array([1, 2, 3]) array([2, 5, 4])）。

但是，我不确定如何从该链接信息中有效地计算 cluster_id 用于大型数组大小 (N > 10⁶)。

编辑：正如 @Chris_Rands 所建议的，这确实可以通过 itertools.groupby 完成，

 import numpy as np
 import itertools

 def get_group_id(x):
     group_id = np.zeros(x.shape, dtype='int')
     for i, j in  itertools.groupby(x):
         j_el = next(j)
         group_id[x==j_el] = i
     return group_id

但是缩放似乎是 O(n^2)，这不会缩放到我的用例 (N > 10⁶)，

  for N in [50000, 100000, 200000]:
      %time _ = get_group_id(np.random.randint(0, N, size=N))

  CPU times: total: 1.53 s
  CPU times: total: 5.83 s
  CPU times: total: 23.9 s

我相信使用重复链接信息会更有效，因为计算 N=200000 的重复对只需要 6.44 µs。

【问题讨论】：

你看过itertools.groupby吗？
用np.bincount怎么样？
@Chris_Rands 感谢您的建议，我刚刚做了，请参阅上面的编辑。
@Anony-Mousse 我不完全同意您的编辑，“聚类”在这里可能确实不是正确的术语，但标题中的“处理重复项”太笼统了。我过度简化了我最初的问题，虽然答案解决了这篇文章（我会接受第一个），但它们并没有解决我的实际问题。我正在使用 simhash 算法在列表中查找附近的重复项（差异小于 k 字节的元素），因此我实际上不能使用np.unique：该算法返回重复对的列表（如上所述）并且我'我希望根据该信息构造cluster_id。
您可以尝试更精确地定义所需输出的定义方式（例如，按第一次出现枚举？）以及为什么首先需要更改整数，以及 为什么 i> 你不能使用np.unique

标签： python algorithm numpy grouping graph-algorithm

【解决方案1】：

你可以使用numpy.unique:

In [13]: x = np.array([1, 0, 0, 3, 3, 0])

In [14]: values, cluster_id = np.unique(x, return_inverse=True)

In [15]: values
Out[15]: array([0, 1, 3])

In [16]: cluster_id
Out[16]: array([1, 0, 0, 2, 2, 0])

（集群 ID 是按照唯一值排序的顺序分配的，而不是按照值在输入中首次出现的顺序。）

集群 0 中项目的位置：

In [22]: cid = 0

In [23]: values[cid]
Out[23]: 0

In [24]: (cluster_id == cid).nonzero()[0]
Out[24]: array([1, 2, 5])

【讨论】：

哇，我错过了np.unique 的return_inverse=True 参数；这很有效，而且规模很大，谢谢！

【解决方案2】：

这是一种使用np.unique 来根据数字的第一次出现来保持顺序的方法-

unq, first_idx, ID = np.unique(x,return_index=1,return_inverse=1)
out = first_idx.argsort().argsort()[ID]

示例运行 -

In [173]: x
Out[173]: array([1, 0, 0, 3, 3, 0, 9, 0, 2, 6, 0, 0, 4, 8])

In [174]: unq, first_idx, ID = np.unique(x,return_index=1,return_inverse=1)

In [175]: first_idx.argsort().argsort()[ID]
Out[175]: array([0, 1, 1, 2, 2, 1, 3, 1, 4, 5, 1, 1, 6, 7])

【讨论】：

感谢您的回复。我不想保留第一个出场号码，但我很感激这个答案。