迭代列表时更改列表的最佳方法[重复]答案

【问题标题】：Best method for changing a list while iterating over it [duplicate]迭代列表时更改列表的最佳方法[重复]
【发布时间】：2012-05-05 13:16:34
【问题描述】：

我在 python 脚本 (v2.6) 中有几个实例，我需要就地修改列表。我需要从列表中弹出值以响应用户的交互式输入，并且想知道最干净的方法。目前，我有以下非常肮脏的解决方案：a）将列表中要删除的项目设置为 False 并使用过滤器或列表理解将它们删除，或者 b）在循环时生成一个全新的列表，这似乎是不必要的将变量添加到命名空间并占用内存。

这个问题的一个例子如下：

for i, folder in enumerate(to_run_folders):
    if get_size(folder) < byte_threshold:
        ans = raw_input(('The folder {0}/ is less than {1}MB.' + \
                    ' Would you like to exclude it from' + \
                    ' compression? ').format(folder, megabyte_threshold))
        if 'y' in ans.strip().lower():
            to_run_folders.pop(i)

我想查看列表中的每个文件夹。如果当前文件夹小于某个大小，我想询问用户是否要排除它。如果有，请从列表中弹出该文件夹。

这个例程的问题是，如果我遍历列表，我会得到意外的行为和提前终止。如果我通过切片对副本进行迭代，pop 不会提取正确的值，因为索引被移动，并且随着更多项目的弹出，问题变得更加复杂。我也需要在脚本的其他区域进行这种动态列表调整。这种功能有什么干净的方法吗？

【问题讨论】：

坚持你的 dirty 解决方案。它显然是最快的，而且问题更少。

标签： python loops dynamic

【解决方案1】：

您可以向后循环列表，或使用视图对象。

请参阅https://stackoverflow.com/a/181062/711085 了解如何向后循环列表。基本上使用reversed(yourList)（这会创建一个向后访问的视图对象）。

如果您需要索引，您可以使用reversed(enumerate(yourList))，但这会有效地在内存中创建一个临时列表，因为enumerate 需要在reversed 启动之前运行。您需要进行索引操作，或这样做：

for i in xrange(len(yourList)-1, -1, -1):
    item = yourList[i]
    ...

更干净：reversed 知道range，因此您可以在 python3 中执行此操作，如果您使用 xrange 代替，则可以在 python2 中执行此操作：

for i in reversed(range(len(yourList))):  
    item = yourList[i]
    ...

（证明：你可以做next(reversed(range(10**10)))，但是如果使用python2这会导致你的计算机崩溃）

【讨论】：

请看一下测量结果。
@pepr：我已经在你的回答中评论了。

【解决方案2】：

你可以向后循环

向后：

x = range(10)
l = len(x)-1  # max index

for i, v in enumerate(reversed(x)):
    if v % 2:
        x.pop(l-i)  # l-1 is the forward index

【讨论】：

应该注意的是，尽管我的回答是reversed(enumerate(yourList)) 将制作列表的副本，但enumerate(reversed(x)) 的此解决方案确实有效地工作，而无需制作列表的副本。您可以将 i = len(x)-1-i 作为 for 循环的第一行来修复索引以获得额外的可读性。
@ninjagecko：我实际上正要评论您对此的回答。我不确定它们是否会复制，因为它们都在给定的可迭代参数上返回生成器。我错了吗？无论哪种方式，一个都不会让另一个产生吗？
欢迎您尝试在提示符中输入next(reversed(enumerate(10**8)))，希望您不会用完所有计算机内存。 =) 正如我在回答中所说，reversed 必须等到enumerate 看到（并因此缓存）整个列表才能返回最后一个元组。
@ninjagecko：reversed(enumerate(yourList)) 甚至是不可能的。它引发了一个类型错误，抱怨必须给reversed() 一个序列。所以，显然你甚至不能这样开始:-)
啊，确实。然而，在 python3 中这样做是很有可能的；我猜python2等价物将使用list将enumerate(...)的结果转换为序列。

【解决方案3】：

目前，我有以下非常肮脏的解决方案：a) 将列表中要删除的项目设置为 False，然后使用过滤器或列表理解将它们删除，或者 b) 在循环时生成一个全新的列表，这似乎不必要地向命名空间添加变量并占用内存。

实际上，这并不是那么肮脏的解决方案。列表通常有多长？即使创建新列表也不应该消耗太多内存，因为列表只包含引用。

您也可以在while 循环中循环并自己枚举，如果用户决定执行del lst[n]（可能单独计算原始位置中的位置）。

【讨论】：

【解决方案4】：

好的，我已经测量了解决方案。颠倒的解决方案大致相同。前向 while 循环慢了大约 4 倍。 但是！Patrik 的 dirty 解决方案对于 100,000 个随机整数列表的速度大约快 80 倍[Patrik2 中的错误已更正]：

import timeit
import random

def solution_ninjagecko1(lst):
    for i in xrange(len(lst)-1, -1, -1):
        if lst[i] % 2 != 0:    # simulation of the choice
            del lst[i]
    return lst

def solution_jdi(lst):
    L = len(lst) - 1
    for i, v in enumerate(reversed(lst)):
        if v % 2 != 0:
            lst.pop(L-i)  # L-1 is the forward index
    return lst

def solution_Patrik(lst):
    for i, v in enumerate(lst):
        if v % 2 != 0:         # simulation of the choice
            lst[i] = None
    return [v for v in lst if v is not None]

def solution_Patrik2(lst):
    ##buggy lst = [v for v in lst if v % 2 != 0]
    ##buggy return [v for v in lst if v is not None]
    # ... corrected to
    return [v for v in lst if v % 2 != 0]

def solution_pepr(lst):
    i = 0                      # indexing the processed item
    n = 0                      # enumerating the original position
    while i < len(lst):
        if lst[i] % 2 != 0:    # simulation of the choice
            del lst[i]         # i unchanged if item deleted
        else:
            i += 1             # i moved to the next
        n += 1
    return lst

def solution_pepr_reversed(lst):
    i = len(lst) - 1           # indexing the processed item backwards
    while i > 0:
        if lst[i] % 2 != 0:    # simulation of the choice
            del lst[i]         # i unchanged if item deleted
        i -= 1                 # i moved to the previous
    return lst

def solution_steveha(lst):
    def should_keep(x):
        return x % 2 == 0
    return filter(should_keep, lst)

orig_lst = range(30)
print 'range() generated list of the length', len(orig_lst)
print orig_lst[:20] + ['...']   # to have some fun :)

lst = orig_lst[:]  # copy of the list
print solution_ninjagecko1(lst)

lst = orig_lst[:]  # copy of the list
print solution_jdi(lst)

lst = orig_lst[:]  # copy of the list
print solution_Patrik(lst)

lst = orig_lst[:]  # copy of the list
print solution_pepr(lst)

orig_lst = [random.randint(1, 1000000) for n in xrange(100000)]
print '\nrandom list of the length', len(orig_lst)
print orig_lst[:20] + ['...']   # to have some fun :)

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_ninjagecko1(lst)',
                  'from __main__ import solution_ninjagecko1, lst',
                  number=1)
print 'solution_ninjagecko1: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_jdi(lst)',
                  'from __main__ import solution_jdi, lst',
                  number=1)
print 'solution_jdi: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_Patrik(lst)',
                  'from __main__ import solution_Patrik, lst',
                  number=1)
print 'solution_Patrik: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_Patrik2(lst)',
                  'from __main__ import solution_Patrik2, lst',
                  number=1)
print 'solution_Patrik2: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_pepr_reversed(lst)',
                  'from __main__ import solution_pepr_reversed, lst',
                  number=1)
print 'solution_pepr_reversed: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_pepr(lst)',
                  'from __main__ import solution_pepr, lst',
                  number=1)
print 'solution_pepr: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_steveha(lst)',
                  'from __main__ import solution_steveha, lst',
                  number=1)
print 'solution_steveha: ', t

它在我的控制台上打印：

c:\tmp\_Python\Patrick\so10305762>python a.py
range() generated list of the length 30
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, '...']
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

random list of the length 100000
[915411, 954538, 794388, 847204, 846603, 454132, 866165, 640004, 930488, 609138,
 333405, 986073, 318301, 728151, 996047, 117633, 455353, 581737, 55350, 485030,
'...']
solution_ninjagecko1:  2.41921752625
solution_jdi:  2.45477176569
solution_Patrik:  0.0468565138865
solution_Patrik2:  0.024270403082
solution_pepr_reversed:  2.43338888043
solution_pepr:  9.11879694207

所以，我尝试了更长的列表。只使用两倍的时间会产生很大的不同（在我的旧电脑上）。 Patrik 的 dirty 解决方案表现得非常好。它比反向解决方案快约 200 倍：

random list of the length 200000
[384592, 170167, 598270, 832363, 123557, 81804, 319315, 445945, 178732, 726600,
516835, 392267, 552608, 40807, 349215, 208111, 880032, 520614, 384119, 350090, 
'...']
solution_ninjagecko1:  17.362140719
solution_jdi:  17.86837545
solution_Patrik:  0.0957998851809
solution_Patrik2:  0.0500024444448
solution_pepr_reversed:  17.5078452708
solution_pepr:  52.175648581

[在ninjagecko的cmets之后添加]

修正后的 Patrik2 解决方案比 2 阶段 Patrick 解决方案快大约两倍。

为了模拟不那么频繁地删除元素，if v % 2 != 0: 之类的测试更改为if v % 100 == 0:。然后应该删除大约 1% 的项目。很明显，它需要更少的时间。对于列表中的 500,000 个随机整数，结果如下：

random list of the length 500000
[403512, 138135, 552313, 427971, 42358, 500926, 686944, 304889, 916659, 112636,
791585, 461948, 82622, 522768, 485408, 774048, 447505, 830220, 791421, 580706, 
'...']
solution_ninjagecko1:  6.79284210703
solution_jdi:  6.84066913532
solution_Patrik:  0.241242951269
solution_Patrik2:  0.162481823807
solution_pepr_reversed:  6.92106007886
solution_pepr:  7.12900522273

Patrick 的解决方案仍然快 30 倍左右。

[添加于 2012 年 4 月 25 日]

另一种就地工作的解决方案，向前循环，与帕特里克的解决方案一样快。删除元素时，它不会移动所有尾部。相反，它将想要的元素移动到它们的最终位置，然后切断列表中未使用的尾部。

def solution_pepr2(lst):
    i = 0
    for v in lst:
        lst[i] = v              # moving the element (sometimes unneccessary)
        if v % 100 != 0:        # simulation of the choice
            i += 1              # here will be the next one
    lst[i:] = []                # cutting the tail of the length of the skipped
    return lst

# The following one only adds the enumerate to simulate the situation when
# it is needed -- i.e. slightly slower but with the same complexity.        
def solution_pepr2enum(lst):
    i = 0
    for n, v in enumerate(lst):
        lst[i] = v              # moving the element (sometimes unneccessary)
        if v % 100 != 0:        # simulation of the choice
            i += 1              # here will be the next one
    lst[i:] = []                # cutting the tail of the length of the skipped
    return lst

与v % 100 != 0的上述解决方案相比：

random list of the length 500000
[533094, 600755, 58260, 295962, 347612, 851487, 523927, 665648, 537403, 238660,
781030, 940052, 878919, 565870, 717745, 408465, 410781, 560173, 51010, 730322, 
'...']
solution_ninjagecko1:  1.38956896051
solution_jdi:  1.42314502685
solution_Patrik:  0.135545530079
solution_Patrik2:  0.0926935780151
solution_pepr_reversed:  1.43573239178
solution_steveha:  0.122824246805
solution_pepr2:  0.0938177241656
solution_pepr2enum:  0.11096263294

【讨论】：

有趣。不幸的是，您没有检查reversed(range(len(yourList))) 解决方案（尽管如果您这样做了，它将与第一个解决方案大致相同）。但是我不认为基准是合理的。在这些基准测试中，您将删除一半个元素。在这种情况下，我只会做[x for i,x in enumerate(lst) if i%2!=0] 并忽略就地要求；这可以实现两倍于您基准测试的最快解决方案的结果。此外，您提供的解决方案不是“肮脏”的解决方案，因为它不是就地的并且使用[...]。
确实，如果只删除列表中 20% 的元素，Patrick 方法的速度大约是 4 倍，如果只删除列表中 10% 的元素，则根据您的测试，Patrick 方法的速度大约是 2 倍，但不幸的是我没有时间仔细检查它们。
@ninjagecko：我也很惊讶。我试图删除大约 1% 的随机元素（参见编辑后的文本）。对于 500,000 个元素，Patrick 的方法仍然快 30 倍左右。问题是：我们为什么要坚持就地解决方案？
Patrick2 中的错误已更正 -- 大约快两倍。
这是一个愚蠢的测试。它的权重取决于实际发生的删除次数。当我用pass 替换内部循环逻辑，用if False 替换案例测试等...以防止发生任何删除时，@ninjagecko 出现在前面，然后是我的，然后是其他的。

【解决方案5】：

处理这个问题的最好方法，最“Pythonic”的方法，实际上是循环遍历您的列表并创建一个仅包含您想要的文件夹的新列表。以下是我的做法：

def want_folder(fname):
    if get_size(folder) >= byte_threshold:
        return True
    ans = raw_input(('The folder {0}/ is less than {1}MB.' + \
                ' Would you like to exclude it from' + \
                ' compression? ').format(folder, megabyte_threshold))
    return 'y' not in ans.strip().lower()

to_run_folders = [fname for fname in to_run_folders if want_folder(fname)]

如果您的列表确实很大，那么您可能需要担心此解决方案的性能并使用肮脏的技巧。但是，如果您的列表很大，那么让人工回答有关所有可能显示的文件的是/否问题可能有点疯狂。

性能是一个实际问题还是只是一种烦人的担忧？因为我很确定上面的代码对于实际使用来说已经足够快了，而且它比复杂的代码更容易理解和修改。

编辑：@jdi 建议在 cmets 中使用 itertools.ifilter() 或 filter()

我测试过，这实际上应该比我上面显示的要快：

to_run_folders = filter(want_folder, to_run_folders)

我刚刚复制了@pepr 的基准测试代码，并使用filter() 测试了解决方案，如下所示。它总体上是第二快的，只有 Patrik2 更快。 Patrik2 的速度是前者的两倍，但同样，任何小到可以让人类回答“是/否”问题的数据集都可能足够小，以至于两倍不会有太大影响。

编辑：只是为了好玩，我继续编写了一个纯列表理解的版本。它只有一个表达式来计算，没有 Python 函数调用。

to_run_folders = [fname for fname in to_run_folders
        if get_size(fname) >= mb_threshold or
                'y' not in raw_input(('The folder {0}/ is less than {1}MB.' +
                ' Would you like to exclude it from compression? '
                ).format(fname, mb_threshold)).strip().lower()

]

呸！我更喜欢制作一个函数。

【讨论】：

这基本上是itertools.ifilter(want_folder, to_run_folders)，更快更高效。
itertools.ifilter() 确实可以很好地替代我建议的生成器表达式。但是如果你想要这个列表呢？ list(itertools.ifilter(want_folder, to_run_folders)) 比列表理解快吗？我刚刚检查了timeit，通过一个涉及过滤一长串int 值的简单测试，它快了大约25%。
如果您希望将结果直接作为列表，请使用filter。这仅取决于以后如何使用结果值。
这些是否比列表理解更快，因为 listcomp 反复重新绑定变量名？ listcomp 和 filter() 都是 Python 内置的，因此都是用 C 编写的，所以这是可以解释速度差异的主要因素（至少我能想到）。
我认为，如果它们都调用 python 函数，它们通常应该有点接近。但是，如果过滤器正在调用内置函数，那么它会更快。在这种特定情况下，它们都有调用 python 函数的开销。