有效地删除 Python 中的连续对重复项？答案

【问题标题】：Efficiently removing consecutive pair duplicates in Python?有效地删除 Python 中的连续对重复项？
【发布时间】：2020-08-26 09:59:45
【问题描述】：

我有一堆长列表（数百万个元素长），其中包含时间值和温度值（[time, temperature]）。列表如下所示：

mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]

我想要做的是摆脱连续对重复。如果连续重复一对温度，请去掉这些重复的元素（只保留一个）。

这个措辞可能有点令人困惑，所以我将提供一个使用 mylist 的示例：

mylist[0] 和 mylist[1] 是连续的对。与mylist[1] 和mylist[2] 相同，以此类推。

继续前进。现在，查看来自mylist 的温度值。从mylist[0]一直到mylist[11]，温度值为：

72 75 74 75 74 75 79 71 79 71 75 74

在上述温度值中，您可以看到 75 74 和 79 71 对以连续方式重复出现。我想要做的只是保留一对，并摆脱重复。所以，我想要的输出是：

output = [[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]

注意：元素 [11, 75] 和 [12, 74] 被保留，因为虽然它们也包含此 75 74 模式，但它们不会像列表中的前面那样连续重复。

为了解决这个问题，我搜索并尝试了很多东西。我得到的最接近的方法是使用for 循环创建解决方案，我将在其中检查一个元素和前一个元素（index-1），然后检查 index-2 和 index-3，如果它们确定有温度重复，我会删除两个元素。然后，我会重复这个向前看（索引+1）。它有点工作，但事情变得非常混乱而且非常缓慢，它把我的电脑变成了一个便携式加热器。所以，我想知道是否有人知道如何有效快速地摆脱这些连续的重复对。

【问题讨论】：

图案长度可以大于2吗？也就是 [72, 75, 74] 可以是模式吗？
@GilseungAhn 您好，感谢您的回复！该模式应该只有 2 的长度。这是因为温度经常在两点之间波动，我想摆脱这些波动以使数据文件更小。这有帮助吗？

标签： python-3.x list performance duplicates

【解决方案1】：

假设我正确理解了要求，下面的代码就可以完成这项工作。

mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]

n = len(mylist)
index = 0
output_list = []

# We need at least four elements to check if there is a duplicate pair.
while index + 4 <= n:
    sub_list = mylist[index: index + 4]

    if sub_list[0][1] == sub_list[2][1] and sub_list[1][1] == sub_list[3][1]:
        print('Duplicate found')
        # Drop the second one.
        output_list.append(sub_list[0])
        output_list.append(sub_list[1])
        index += 4
    else:
        # We add only the first element as the there can be a potential duplicate that can be found later on when we consider the next element.
        output_list.append(sub_list[0])
        index += 1

# Append the remaining elements if any exist.
output_list.extend(mylist[index:])


print(output_list)

【讨论】：

【解决方案2】：

使用collections.deque：

from collections import deque

mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]

def generate(lst):
    d = deque(maxlen=4)
    for v in lst:
        d.append(v)
        if len(d)==4:
            if d[0][1] == d[2][1] and d[1][1] == d[3][1]:
                d.pop()
                d.pop()
            else:
                yield d.popleft()

    yield from d # yield the rest of deque


out = [*generate(mylist)]
print(out)

打印：

[[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]

基准测试（使用 10_000_000 个元素）：

import random
from timeit import timeit

mylist = []
for i in range(10_000_000):
    mylist.append([i, random.randint(50, 100)])

def f1():
    return [*generate(mylist)]

t1 = timeit(lambda: f1(), number=1)
print(t1)

在我的机器上打印（AMD 2400G，Python 3.8）：

3.2782217629719526

【讨论】：

+1 以获得很好的答案！再次感谢您的帮助，安德烈。您和 Srikant 的答案似乎都对我的数据非常有效，而且它不使用任何 3rd 方库真是太好了。我希望我可以将两者都标记为“已接受”。我刚刚将 Srikant 标记为“最佳”，仅仅是因为他的分数较低——我希望你不介意。再次，我非常感谢您的回答 - 这是一个很好的解决方案。

【解决方案3】：

使用collection.Counters 和numpy。

试试这个代码。

import numpy as np
from collections import Counter

def remove_consecutive_pair_duplicate(L):
    temperature = np.array(L, dtype = str)[:, 1]
    l = 2 # length of pattern       
    pattern_with_length_l = Counter('-'.join(temperature[i:i+l]) for i in range(len(temperature) - l))

    set_of_patterns = []
    for (key, val) in pattern_with_length_l.items():
        left, right = key.split('-')        
        if val >= 2 and right + '-' + left not in set_of_patterns:
            set_of_patterns.append(key)

    removed_index = []
    for pattern in set_of_patterns:
        matched_index = [[i, i+1] for i in range(len(temperature) - l) if '-'.join(temperature[i:i+2]) == pattern]
        for ind in matched_index[1:]:
            removed_index.append(ind[0])
            removed_index.append(ind[1])

    survived_ind = list(set(list(range(len(L)))) - set(removed_index))
    return np.array(L)[survived_ind].tolist()

print(remove_consecutive_pair_duplicate(mylist))

结果如下。

[[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]

【讨论】：

+1 感谢您的回答！但是，Andrej 的解决方案似乎是最快的。很好的答案，尽管如此！