python 2.7中巨大列表的时间复杂度答案

【问题标题】：Time complexity of a huge list in python 2.7python 2.7中巨大列表的时间复杂度
【发布时间】：2014-03-27 13:08:45
【问题描述】：

我有一个列表，其中包含大约 177071007 项。我正在尝试执行以下操作 a) 获取列表中唯一项的第一次和最后一次出现。 b) 出现次数。

def parse_data(file, op_file_test):
    ins = csv.reader(open(file, 'rb'), delimiter = '\t')
    pc = list()
    rd = list()
    deltas = list()
    reoccurance = list()
    try:
        for row in ins:
            pc.append(int(row[0]))
            rd.append(int(row[1]))
    except:
        print row
        pass

    unique_pc = set(pc)
    unique_pc = list(unique_pc)
    print "closing file"

    #takes a long time from here!
    for a in range(0, len(unique_pc)):
        index_first_occurance = pc.index(unique_pc[a])
        index_last_occurance = len(pc) - 1 - pc[::-1].index(unique_pc[a])
        delta_rd = rd[index_last_occurance] - rd[index_first_occurance]
        deltas.append(int(delta_rd))
        reoccurance.append(pc.count(unique_pc[a]))
        print unique_pc[a] , delta_rd, reoccurance[a]

    print "printing to file"
    map_file =  open(op_file_test,'a')
    for a in range(0, len(unique_pc)):
        print >>map_file, "%d, %d, %d" % (unique_pc[a], deltas[a], reoccurance)
    map_file.close()

但是复杂度在 O(n) 的数量级。是否有可能使 for 循环“运行得快”，我的意思是，你认为 yield 会使其更快吗？还是有其他方法？不幸的是，我没有 numpy

【问题讨论】：

标签： python linux

【解决方案1】：

尝试以下方法：

from collections import defaultdict

# Keep a dictionary of our rd and pc values, with the value as a list of the line numbers each occurs on
# e.g. {'10': [1, 45, 79]}
pc_elements = defaultdict(list)
rd_elements = defaultdict(list)

with open(file, 'rb') as f:
    line_number = 0
    csvin = csv.reader(f, delimiter='\t')
    for row in csvin:
        try:
            pc_elements[int(row[0])].append(line_number)
            rd_elements[int(row[1])].append(line_number)
            line_number += 1
        except ValueError:
            print("Not a number")
            print(row)
            line_number += 1
            continue

for pc, indexes in pc_elements.iteritems():
    print("pc  {0} appears {1} times. First on row {2}, last on row {3}".format(
        pc,
        len(indexes),
        indexes[0],
        indexes[-1]
    ))

这通过创建字典来工作，当读取 TSV 时，以 pc 值作为键，并将出现的列表作为值。由于 dict 的性质，键必须是唯一的，因此我们避免 set 和 list 值仅用于保留键出现的行。

例子：

pc_elements = {10: [4, 10, 18, 101], 8: [3, 12, 13]}

会输出：

"pc 10 appears 4 times. First on row 4, last on row 101"
"pc 8 appears 3 times. First on row 3, last on row 13"

【讨论】：

continue 与 pass 有何不同？ :)
@pistal - continue 移动到循环的下一次迭代（在它之后不运行任何代码），而pass 将在下一次迭代之前运行它之后的任何代码。一个很好的例子here.

【解决方案2】：

当您从输入文件扫描项目时，将项目放入collections.defaultdict(list)，其中键是项目，值是出现索引列表。读取文件并建立此数据结构需要线性时间，获取项目的第一次和最后一次出现索引需要恒定时间，获取项目出现次数需要恒定时间。

这就是它的工作原理

mydict = collections.defaultdict(list)
for item, index in itemfilereader: # O(n)
    mydict[item].append(index)

# first occurrence of item, O(1)
mydict[item][0]

# last occurrence of item, O(1)
mydict[item][-1]

# number of occurrences of item, O(1)
len(mydict[item])

【讨论】：

itemfilereader 是您的 csvreader 加上可以为您提供项目出现的索引（行号）的东西。 enumerate 浮现在脑海中。
即使使用 O(1) 数据结构，为 1.77 亿个项目读入和分配足够的内存也会很慢。除非您的许多项目不止一次出现，否则您可能希望在阅读文件时跳过您知道不需要的任何项目。如果您只需要知道某个项目的第一次和最后一次出现，则可以避免读取整个文件。这就是 Linux 中的 head 和 tail 命令行程序所做的。

【解决方案3】：

也许值得改变使用的数据结构。我会使用一个字典，它使用 pc 作为键，出现作为值。

lookup = dict{}
counter = 0
for line in ins:
    values = lookup.setdefault(int(line[0]),[])
    values.append(tuple(counter,int(line[1])))
    counter += 1

for key, val in lookup.iteritems():
    value_of_first_occurence = lookup[key][1][1]
    value_of_last_occurence = lookup[key][-1][1]
    first_occurence = lookup[key][1][0]
    last_occurence = lookup[key][-1][0]
    value = lookup[key][0]

【讨论】：

抱歉，贴错了代码。现在它更新了。我读取每一行并将“pc”作为键存储在字典中，以及一个包含 csv 阅读器（“rd”）返回的值的列表以及出现的嵌套列表。
我更抱歉。您对 delta rd 感兴趣，而不是 delta 位置。我更新了。

【解决方案4】：

尝试用字典替换列表，在字典中查找比在长列表中查找要快很多。

可能是这样的：

def parse_data(file, op_file_test):
  ins = csv.reader(open(file, 'rb'), delimiter = '\t')

  # Dict of pc -> [rd first occurence, rd last occurence, list of occurences]
  occurences = {} 

  for i in range(0, len(ins)):
    row = ins[i]
    try:
      pc = int(row[0])
      rd = int(row[1])
    except:
      print row
      continue

    if pc not in occurences:
      occurences[pc] = [rd, rd, i]
    else:
      occurences[pc][1] = rd
      occurences[pc].append(i)

  # (Remove the sorted is you don't need them sorted but need them faster)
  for value in sorted(occurences.keys()):
    print "value: %d, delta: %d, occurences: %s" % (
      value, occurences[value][1] - occurences[value][0],
      ", ".join(occurences[value][2:])

【讨论】：