在 Python 的巨大列表中搜索（数字）字符串的匹配项答案

【问题标题】：Searching for matches of (numeric) string in a huge list in Python在 Python 的巨大列表中搜索（数字）字符串的匹配项
【发布时间】：2013-10-19 00:22:14
【问题描述】：

我有一个排序字符格式的数字列表（1000 万），每个条目都是 15 个字符的恒定长度。像这样思考：

100000000000000
100000000000001
...
100000010000000

现在我想在这个列表中创建一个定期细分，看看条目是如何在不同范围内累积的。输出可能是这样的：

100000000xxxxxx, 523121 entries
100000001xxxxxx, 32231 entries

目前我已尝试将整个列表读取为一组，然后进行搜索。我已经尝试了string 和int 格式。整数版本比当前的字符串版本快 3 倍。代码如下：

collection_str = set(line.strip() for line in open(inputfile)
collection_int = set(int(line.strip()) for line in open(sys.argv[1]))

def find_str(look_for, ourset):
    count = 0
    for entry in ourset:
            if entry.startswith(look_for):
                    count += 1
    return count

def find_int(look_for, ourset):
    search_min = int(str(look_for) + "000000")
    search_max = int(str(look_for+1) + "000000")

    count = 0
    for entry in ourset:
            if entry >= search_min and entry < search_max:
                    count += 1
    return count

结果如下所示：

"int version"
100000100 27401 (0.515992sec)
100000101 0 (0.511334sec)
100000102 0 (0.510956sec)
100000103 0 (0.510467sec)
100000104 0 (0.512834sec)
100000105 0 (0.511501sec)

"string version"
100000100 27401 (1.794804sec)
100000101 0 (1.794449sec)
100000102 0 (1.802035sec)
100000103 0 (1.797590sec)
100000104 0 (1.793691sec)
100000105 0 (1.796785sec)

我想知道我是否可以以某种方式让它更快？即使使用 0,5s / 范围，如果我想经常运行它以创建一些周期性统计数据，这仍然需要时间...... 从周围的搜索中，我看到有些人使用bisect 来做类似的事情，但我似乎无法理解它应该如何工作。

【问题讨论】：

可以上传示例文件吗？我想试试这个。
bisect 用于二进制搜索。
对于范围：首先排序，然后是二进制搜索。对于元素：使用 dict 并使用 in 运算符。
我想可以产生一些东西..但数据是雇主的财产；）等一下
@SaltyEgg，啊 - 抱歉，我确实已经对列表进行了预先排序。将其添加到问题中

标签： python performance algorithm list search

【解决方案1】：

如果列表已排序，则 bisect 将使用 bisection search 查找符合您条件的索引。看起来 bisect 比使用 numpy 数组要快得多。

import numpy as np
import bisect
from random import randint
from timeit import Timer

ip = ['1{0:014d}'.format(randint(0, 10000000)) for x in xrange(10000000)]
ip = sorted(ip)
print bisect.bisect(ip, '100000000010000')
# 9869
t = Timer("bisect.bisect(ip, '100000000010000')", 'from __main__ import bisect, ip')
print t.timeit(100)
# 0.000268309933485 seconds

ip_int = map(int, ip)
print bisect.bisect(ip_int, 100000000010000)
# 9869
t = Timer("bisect.bisect(ip_int, 100000000010000)", 'from __main__ import bisect, ip_int')
print t.timeit(100)
# 0.000137443078672 seconds

ip_numpy = np.array(ip_int)
print np.sum(ip_numpy <= 100000000010000)
# 9869
t = Timer("np.sum(ip_numpy <= 100000000010000)", 'from __main__ import np, ip_numpy')
print t.timeit(100)
# 8.23690123071 seconds

Binary search algorithm

【讨论】：

【解决方案2】：

将其放入一个 numpy 数组中。然后，您可以使用既好又快的矢量化 :)

from random import randint
import numpy
ip = numpy.array(['1{0:014d}'.format(randint(0, 10000000)) for x in xrange(10000000)], dtype=numpy.int64)

numpy.sum(ip <= 100000000010000)
# 9960
%timeit numpy.sum(ip <= 100000000010000)
# 10 loops, best of 3: 35 ms per loop

将其放在您的搜索功能方面：

import numpy

def find_numpy(look_for, ourset):
    search_min = int('{0:0<15s}'.format(str(look_for)))
    search_max = int('{0:0<15s}'.format(str(look_for+1)))
    return numpy.sum((ourset >= search_min) & (ourset < search_max))

with open('path/to/your/file.txt', 'r') as f:
    ip = numpy.array([line.strip() for line in f], dtype=numpy.int64)

find_numpy(1000000001, ip)
# 99686
%timeit find_numpy(1000000001, ip)
# 10 loops, best of 3: 86.6 ms per loop

【讨论】：

你能解释一下你的代码吗？您正在尝试在小于100000000010000 的元素上运行sum（根据手动“给定轴上的数组元素总和”？那么100000000010001 不会给出错误的结果吗？我似乎总是得到0结果……
刚刚进行了编辑，以便更清楚地了解您的其他搜索功能。
本质上，我们正在将您的列表读入一个 dtype=numpy.int64 的 numpy 数组，该数组会将您的所有字符串转换为整数。现在我们基本上可以遵循整数搜索的逻辑，但使用 numpy 的向量化而不是循环。
读取您的文件：numpy.array([line.strip() for line in open(file)], dtype=numpy.int64)
ip <= 100000000010000 产生一个 True/False 值数组， sum() 将 True 视为 1， False 视为 0。