在> 2000000个项目的列表中查找重复索引的快速方法答案

【问题标题】：Fast method to find indexes of duplicates in a lists >2000000 items在> 2000000个项目的列表中查找重复索引的快速方法
【发布时间】：2019-05-28 15:34:00
【问题描述】：

我有一个列表，其中每个项目都是两个事件 ID 的组合：（这只是更大的对列表中的一个 sn-p）

['10000381 10007121', '10000381 10008989', '10005169 10008989', '10008989 10023817', '10005169 10043265', '10008989 10043265', '10023817 10043265', '10047097 10047137', '10047097 10047265', '10047137 10047265', '10000381 10056453', '10047265 10056453', '10000381 10060557', '10007121 10060557', '10056453 10060557', '10000381 10066013', '10007121 10066013', '10008989 10066013', '10026233 10066013', '10056453 10066013', '10056453 10070153', '10060557 10070153', '10066013 10070153', '10000381 10083798', '10047265 10083798', '10056453 10083798', '10066013 10083798', '10000381 10099969', '10056453 10099969', '10066013 10099969', '10070153 10099969', '10083798 10099969', '10056453 10167029', '10066013 10167029', '10083798 10167029', '10099969 10167029', '10182073 10182085', '10182073 10182177', '10182085 10182177', '10000381 10187233', '10056453 10187233', '10060557 10187233', '10066013 10187233', '10083798 10187233', '10099969 10187233', '10167029 10187233', '10007121 10200685', '10099969 10200685', '10066013 10218005', '10223905 10224013']

我需要找到每对 id 的每一个实例并将其索引到一个新列表中。现在我有几行代码可以为我做这件事。但是，我的列表长度超过 2,000,000 行，并且随着我处理更多数据而变得更大。

目前，预计完成时间约为 2 天。

我真的只需要一个更快的方法。

我正在使用 Jupyter Notebooks（在 Mac 笔记本电脑上）

def compiler(idlist):
    groups = []
    for i in idlist:
        groups.append([index for index, x in enumerate(idlist) if x == i])
    return(groups)

我也试过了：

def compiler(idlist):
    groups = []
    for k,i in enumerate(idlist):
        position = []
        for c,j in enumerate(idlist):
            if i == j:
                position.append(c)
        groups.append(position)
    return(groups)

我想要的是这样的：

'10000381 10007121': [0]
'10000381 10008989': [1]
'10005169 10008989': [2, 384775, 864173, 1297105, 1321798, 1555094, 1611064, 2078015]
'10008989 10023817': [3, 1321800]
'10005169 10043265': [4, 29113, 864195, 1297106, 1611081]
[5、864196、2078017]
'10008989 10043265': [6, 29114, 384777, 864198, 1611085, 1840733, 2078019]
'10023817 10043265'：[7,86626,384780,864434,792690,864215,1297108,1524527,155096,1524527,155096,1595763,155098,1840763,1611098,1840734,181098,1840734,181280,1929457,1943701,198362,1943701,198362,209380,2139917,2168437] 等等。等等。等等

括号中的每个数字是idlist中该对的索引。

本质上，我希望它查看一对 id 值（即'10000381 10007121'），并遍历列表并找到该对的每个实例并记录列表中的每个索引这对发生。我需要为列表中的每个项目执行此操作的东西。在更短的时间内。

【问题讨论】：

可以先拆分，然后转换成NumPy数组再使用unique如图here
你需要一个不同的数据结构，大约在几百万行之前。如果允许您进行更改，请查看字典。
现在您正在遍历每个项目的列表，即您有O(n^2) 性能您可以使用groups = collections.defaultdict(list)，然后执行for index, item in enumerate(idlist): groups[item].append(index)，即O(n)。
How to efficiently find the indices of matching elements in two lists的可能重复
您能否更新您的问题以显示匹配的示例数据和“我想要的”输出？另外，您是在寻找重复的 id（'10000381' 和 '10007121'），还是重复的 id 对（'10000381 10007121'）？

标签： python list duplicates

【解决方案1】：

您可以使用collections.OrderedDict 将时间复杂度降低到 O(n)。由于它记住了插入的顺序，因此值类似于各种 id 的出现顺序：

from collections import OrderedDict

groups = OrderedDict()
for i, v in enumerate(idlist):
    try:
        groups[v].append(i)
    except KeyError:
        groups[v] = [i]

然后list(groups.values()) 包含您的最终结果。

【讨论】：

你可以使用setdefault：groups.setdefault(v, []).append(i)，而不是捕捉KeyError。

【解决方案2】：

使用字典代替列表，这使得查找存在O(1)：

def compiler(idlist):
    groups = {}
    for idx, val in enumerate(idlist):
        if val in groups:  
            groups[val].append(idx)
        else:
            groups[val] = [idx]

【讨论】：

将groups 更改为defaultdict(list)，看看这如何简化您的代码（for 循环体将折叠为单个语句）。但是您仍然必须创建 OP 请求的列表列表。

【解决方案3】：

如果您有大量数据，我建议您使用Pypy3 而不是CPython 解释器，您将获得x5-x7 更快的代码执行速度。

这是一个使用CPython 和Pypy3 和1000 iterations 的基于时间的基准测试的实现：

代码：

from time import time
from collections import OrderedDict, defaultdict


def timeit(func, iteration=10000):
    def wraps(*args, **kwargs):
        start = time()
        for _ in range(iteration):
            result = func(*args, **kwargs)
        end = time()
        print("func: {name} [{iteration} iterations] took: {elapsed:2.4f} sec".format(
            name=func.__name__,
            iteration=iteration,
            args=args,
            kwargs=kwargs,
            elapsed=(end - start)
        ))
        return result
    return wraps


@timeit
def op_implementation(data):
    groups = []
    for k in data:
        groups.append([index for index, x in enumerate(data) if x == k])
    return groups


@timeit
def ordreddict_implementation(data):
    groups = OrderedDict()
    for k, v in enumerate(data):
        groups.setdefault(v, []).append(k)
    return groups


@timeit
def defaultdict_implementation(data):
    groups = defaultdict(list)
    for k, v in enumerate([x for elm in data for x in elm.split()]):
        groups[v].append(k)
    return groups


@timeit
def defaultdict_implementation_2(data):
    groups = defaultdict(list)
    for k, v in enumerate(map(lambda x: tuple(x.split()), data)):
        groups[v].append(k)
    return groups


@timeit
def dict_implementation(data):
    groups = {}
    for k, v in enumerate([x for elm in data for x in elm.split()]):
        if v in groups:
            groups[v].append(k)
        else:
            groups[v] = [k]
    return groups



if __name__ == '__main__':
    data = [
        '10000381 10007121', '10000381 10008989', '10005169 10008989', '10008989 10023817', 
        '10005169 10043265', '10008989 10043265', '10023817 10043265', '10047097 10047137', 
        '10047097 10047265', '10047137 10047265', '10000381 10056453', '10047265 10056453', 
        '10000381 10060557', '10007121 10060557', '10056453 10060557', '10000381 10066013', 
        '10007121 10066013', '10008989 10066013', '10026233 10066013', '10056453 10066013', 
        '10056453 10070153', '10060557 10070153', '10066013 10070153', '10000381 10083798', 
        '10047265 10083798', '10056453 10083798', '10066013 10083798', '10000381 10099969', 
        '10056453 10099969', '10066013 10099969', '10070153 10099969', '10083798 10099969', 
        '10056453 10167029', '10066013 10167029', '10083798 10167029', '10099969 10167029', 
        '10182073 10182085', '10182073 10182177', '10182085 10182177', '10000381 10187233', 
        '10056453 10187233', '10060557 10187233', '10066013 10187233', '10083798 10187233', 
        '10099969 10187233', '10167029 10187233', '10007121 10200685', '10099969 10200685', 
        '10066013 10218005', '10223905 10224013'
    ]
    op_implementation(data)
    ordreddict_implementation(data)
    defaultdict_implementation(data)
    defaultdict_implementation_2(data)
    dict_implementation(data)

CPython：

func: op_implementation [10000 iterations] took: 1.3096 sec
func: ordreddict_implementation [10000 iterations] took: 0.1866 sec
func: defaultdict_implementation [10000 iterations] took: 0.3311 sec
func: defaultdict_implementation_2 [10000 iterations] took: 0.3817 sec
func: dict_implementation [10000 iterations] took: 0.3231 sec

Pypy3：

func: op_implementation [10000 iterations] took: 0.2370 sec
func: ordreddict_implementation [10000 iterations] took: 0.0243 sec
func: defaultdict_implementation [10000 iterations] took: 0.1216 sec
func: defaultdict_implementation_2 [10000 iterations] took: 0.1299 sec
func: dict_implementation [10000 iterations] took: 0.1175 sec

具有 2000000 次迭代的 Pypy3：

func: op_implementation [200000 iterations] took: 4.6364 sec
func: ordreddict_implementation [200000 iterations] took: 0.3201 sec
func: defaultdict_implementation [200000 iterations] took: 2.2032 sec
func: defaultdict_implementation_2 [200000 iterations] took: 2.4052 sec
func: dict_implementation [200000 iterations] took: 2.2429 sec

【讨论】：