【问题标题】:How to remove duplicates from a list of tuples but keeping the original order如何从元组列表中删除重复项但保持原始顺序
【发布时间】:2014-09-03 17:31:50
【问题描述】:

我想删除多余的元组,但保留出现的顺序。我看了类似的问题。这个问题Find unique rows in numpy.array 看起来很有希望,但不知何故对我不起作用。

我可以在这个答案 (https://stackoverflow.com/a/14089586/566035) 中使用 pandas,但我不喜欢使用 pandas,这样 py2exe 生成的可执行文件会很小。

import numpy as np

data = [('a','z'), ('a','z'), ('a','z'), ('1','z'), ('e','z'), ('c','z')]

#What I want is:
    array([['a', 'z'],
           ['1', 'z'],
           ['e', 'z'],
           ['c', 'z']], 
          dtype='|S1')

#What I have tried:
# (1) numpy.unique, order not preserved
np.unique(data)

    array([['a', 'z'],
           ['c', 'z'],
           ['1', 'z'],
           ['e', 'z']], 
          dtype='|S1')

# (2) python set, order not preserved
set(data)

    set([('1', 'z'), ('a', 'z'), ('c', 'z'), ('e', 'z')])

# (3) answer here : https://stackoverflow.com/a/16973510/566035, order not preserved
a = np.array(data)
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)

a[idx]

    array([['1', 'z'],
           ['a', 'z'],
           ['c', 'z'],
           ['e', 'z']], 
          dtype='|S1')

【问题讨论】:

    标签: python sorting numpy unique


    【解决方案1】:

    这在效率方面不是很好,但是非常简单易读的代码并且可以用于较小的列表:

    sorted(set(data), key=data.index)

    【讨论】:

    • 哇。这个也不错谢谢!
    • 很难选择答案,但我想我更喜欢这个。谢谢大家!
    • 哇 - 排序和索引操作的开销是惊人的......我不会说它“在效率方面不是很好” - 我会说它真的,真的可怜的:(
    【解决方案2】:

    哎呀!我自己找到了答案……

    seen = set()
    np.array([x for x in data if x not in seen and not seen.add(x)])
    
    # output
    array([['a', 'z'],
           ['1', 'z'],
           ['e', 'z'],
           ['c', 'z']], 
          dtype='|S1')
    

    【讨论】:

    • 我从来没有想过使用and not 来强制调用返回None 这样的调用。偷偷摸摸!
    • 是的,很棘手。这是我很久以前在stackoverflow的某个地方发现的东西。但我不记得在哪里。
    【解决方案3】:

    根据我的测试,使用一组检查唯一性的列表理解将运行时间提高 4 倍(或从 O(n^2) 到 O(n) 复杂度)

    import functools
    import time
    import string
    import random
    # initializing list
    data = [
        (random.choice(string.ascii_lowercase), 
        random.choice(string.ascii_lowercase)) 
        for _ in range(10_000)
    ]
    
    # using reduce to isolate uniques while keeping order
    start_time = time.time()
    control_set = set()
    result1 = functools.reduce(lambda a, b: control_set.add(b) or a+[b] if b not in control_set else a, data, [])
    print("--- %s seconds ---" % (time.time() - start_time))
    "--- 0.002000570297241211 seconds ---"
    
    # creating a set and ordering them by original index
    start_time = time.time()
    control_set = set()
    result2 = sorted(set(data), key=data.index)
    print("--- %s seconds ---" % (time.time() - start_time))
    "--- 0.00800013542175293 seconds ---"
    
    # list comprehension with set-membership
    start_time = time.time()
    control_set = set()
    result3 = [
        data_element 
        for data_element in data 
        if data_element not in control_set
        and (control_set.add(data_element) or True)
    ]
    print("--- %s seconds ---" % (time.time() - start_time))
    "--- 0.0010018348693847656 seconds ---"
    
    def get_unique(rec_list):
        if len(rec_list) == 1:
            return rec_list
        l = get_unique(rec_list[:len(rec_list)//2])
        r = get_unique(rec_list[len(rec_list)//2:])
    
        set_l = set(l)
        set_r = set(r)
        set_r -= (set_l.intersection(set_r))
        return sorted(set_l, key=l.index) + sorted(set_r, key=r.index)
    
    start_time = time.time()
    result4 = get_unique(data)
    print("--- %s seconds ---" % (time.time() - start_time))
    "--- 0.11902070045471191 seconds ---"
    
    assert result1 == result2 == result3 == result4, "one of them failed"
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2010-10-03
      • 1970-01-01
      • 2011-03-05
      • 1970-01-01
      • 2018-04-25
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多