在 Python 列表中大量查找随机索引的最快方法是什么？答案

【问题标题】：What's the fastest way of finding a random index in a Python list, a large number of times?在 Python 列表中大量查找随机索引的最快方法是什么？
【发布时间】：2021-03-23 04:01:44
【问题描述】：

从列表中提取随机值的最佳（最快）方法是多少（>1M）次？

我目前的情况是，我有一个表示为邻接列表的图，其内部列表的长度可能大不相同（在 [2，可能为 100k] 范围内）。

我需要遍历这个列表来生成随机游走，所以我目前的解决方案是

获取随机节点
从该节点的邻接列表中选择一个随机索引
移动到新节点
转到2
重复直到随机游走所需的时间
转到 1

当图形不太大时，这可以很好地工作，但是现在我正在处理一个包含 >440k 节点的图形，每个节点的边数差异很大。

我现在用来提取随机索引的函数是

node_neighbors[int(random.random() * number_neighbors_of_node)]

与我之前的实现相比，这加快了计算速度，但对于我的目的而言，它仍然慢得令人无法接受。

一个节点的邻居数量可以从2个到上万个，我无法移除小节点，我必须在这种环境下生成数万个随机游走。

从分析代码开始，大部分生成时间都花在寻找这些索引上，所以我正在寻找一种可以减少这样做所花费时间的方法。但是，如果可以通过修改算法来完全回避它，那也很棒。

谢谢！

编辑：出于好奇，我使用timeit 测试了相同代码的三个变体，结果如下：

setup='''
import numpy as np
import random

# generate a random adjacency list, nodes have a number of neighbors between 2 and 10000

l = [list(range(random.randint(2, 10000))) for _ in range(10000)]
'''

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)] # Get a random node adj list 
    vv = v[random.randint(0, len(v)-1)] # Find a random neighbor in v

0.29709450000001425

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)]
    vv = v[np.random.choice(v)]

26.760767499999986

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)]
    vv = v[int(random.random()*(len(v)))]

0.19086300000000733

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)]
    vv = v[int(random.choice(v))]

0.24351880000000392

【问题讨论】：

random.choice...?
正如@deceze 提到的，您可以使用random.choice，也可以使用numpy.random.choice
@deceze choice 与 random 相比非常慢。
你需要多快？
@Pawan 那个 numpy 对你来说快吗？我用array([0] * 1000) 对其进行了测试，它比[0] * 1000 上的OP 慢了大约20 倍。

标签： python list performance random

【解决方案1】：

您的解决方案 (sol3) 已经是最快的，比您的测试显示的要快。我调整了性能测量以消除节点的任意选择，以支持更接近您既定目标的路径遍历。

以下是改进后的性能测试和结果。我添加了 sol5() 来查看预先计算一个随机值列表是否会产生影响（我希望 numpy 能够对其进行矢量化，但它并没有变得更快）。

设置

import numpy as np
import random

# generate a random adjacency list, nodes have a number of neighbors between 2 and 10000

nodes     = [list(range(random.randint(2, 10000))) for _ in range(10000)]
pathLen   = 1000

解决方案

def sol1():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[random.randint(0, len(node)-1)] # move to a random neighbor

def sol2():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[np.random.choice(node)]

def sol3():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[int(random.random()*(len(node)))]

def sol4():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[int(random.choice(node))]

def sol5():
    node = nodes[0]
    for rng in np.random.random_sample(pathLen):
        node = nodes[int(rng*len(node))]

测量

from timeit import timeit
count = 100

print("sol1",timeit(sol1,number=count))
print("sol2",timeit(sol2,number=count))
print("sol3",timeit(sol3,number=count))
print("sol4",timeit(sol4,number=count))
print("sol5",timeit(sol5,number=count))

sol1 0.12516996199999975
sol2 30.445685411
sol3 0.03886452900000137
sol4 0.1244026900000037
sol5 0.05330073100000021

numpy 不太擅长处理具有可变维度的矩阵（例如您的邻居列表），但加速该过程的一种方法可能是矢量化下一个节点选择。通过为 numpy 数组中的每个节点分配一个随机浮点数，您可以使用它在节点之间导航，直到您的路径返回到已访问的节点。只有这样，您才需要为该节点生成一个新的随机值。据推测，根据路径长度，这些“碰撞”的数量会相对较少。

使用相同的想法，并利用 numpy 的矢量化，您可以通过创建节点标识符（列）矩阵来并行进行多次遍历，其中每一行都是并行遍历。

为了说明这一点，这里有一个函数可以让多个“蚂蚁”在它们各自的随机路径上通过节点前进：

import numpy as np
import random

nodes   = [list(range(random.randint(2, 10000))) for _ in range(10000)]
nbLinks = np.array(list(map(len,nodes)),dtype=np.int)         # number of neighbors per node
npNodes = np.array([nb+[-1]*(10000-len(nb)) for nb in nodes]) # fixed sized rows for numpy

def moveAnts(antCount=12,stepCount=8,antPos=None,allPaths=False):
    if antPos is None:
        antPos = np.random.choice(len(nodes),antCount)
    paths = antPos[:,None]

    for _ in range(stepCount):
        nextIndex = np.random.random_sample(size=(antCount,))*nbLinks[antPos]
        antPos    = npNodes[antPos,nextIndex.astype(np.int)]
        if allPaths:
            paths = np.append(paths,antPos[:,None],axis=1)
        
    return paths if allPaths else antPos

示例：12 只蚂蚁从随机起始位置随机前进 8 步

print(moveAnts(12,8,allPaths=True))

"""
    [[8840 1302 3438 4159 2983 2269 1284 5031 1760]
     [4390 5710 4981 3251 3235 2533 2771 6294 2940]
     [3610 2059 1118 4630 2333  552 1375 4656 6212]
     [9238 1295 7053  542 6914 2348 2481  718  949]
     [5308 2826 2622   17   78  976   13 1640  561]
     [5763 6079 1867 7748 7098 4884 2061  432 1827]
     [3196 3057   27  440 6545 3629  243 6319  427]
     [7694 1260 1621  956 1491 2258  676 3902  582]
     [1590 4720  772 1366 2112 3498 1279 5474 3474]
     [2587  872  333 1984 7263  168 3782  823    9]
     [8525  193  449  982 4521  449 3811 2891 3353]
     [6824 9221  964  389 4454  720 1898  806   58]]
"""

单个蚂蚁的性能并不好，但同时每个蚂蚁的时间要好得多

from timeit import timeit
count = 100

antCount  = 100
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t) # 0.9538277329999989 / 100 --> 0.009538277329999989 per ant

[编辑] 我为可变大小的行考虑了一个更好的数组模型，并提出了一种不会在固定维度的（大部分为空的）矩阵中浪费内存的方法。该方法是使用一维数组来连续保存所有节点的链接，并使用两个额外的数组来保存起始位置和邻居的数量。事实证明，这种数据结构的运行速度甚至比固定大小的 2D 矩阵还要快。

import numpy as np
import random

nodes     = [list(range(random.randint(2, 10000))) for _ in range(10000)]
links     = np.array(list(n for neighbors in nodes for n in neighbors))
linkCount = np.array(list(map(len,nodes)),dtype=np.int) # number of neighbors for each node
firstLink = np.insert(np.add.accumulate(linkCount),0,0) # index of first link for each node



def moveAnts(antCount=12,stepCount=8,antPos=None,allPaths=False):
    if antPos is None:
        antPos = np.random.choice(len(nodes),antCount)
    paths = antPos[:,None]

    for _ in range(stepCount):
        nextIndex = np.random.random_sample(size=(antCount,))*linkCount[antPos]
        antPos    = links[firstLink[antPos]+nextIndex.astype(np.int)]
        if allPaths:
            paths = np.append(paths,antPos[:,None],axis=1)
        
    return paths if allPaths else antPos

from timeit import timeit
count = 100

antCount  = 100
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t) # 0.7157810379999994 / 100 --> 0.007157810379999994 per ant

“每只蚂蚁”的性能会随着您添加更多它们而提高，达到一定程度（大约比 sol3 快 10 倍）：

antCount  = 1000
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t,t/antCount) #3.9749405650000007, 0.0039749405650000005 per ant

antCount  = 10000
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t,t/antCount) #32.688697579, 0.0032688697579 per ant

【讨论】：

感谢您的回答，我已经在考虑 numpy 在数组大小如此可变的情况下表现不佳，我想知道是否有办法回避问题。我不确定我是否可以立即在我的代码中实现您的解决方案，但无论如何再次感谢您找到一种可能比我已有的更快的技术。
查看我的编辑以更有效地使用 numpy 管理可变大小的矩阵（以及更好的启动性能）。