如果我理解正确的话,你想要一个在一定范围内不重复的数字元组序列。
编辑 0:
我相信你最好的选择是首先创建所有可能的组合,然后将它们洗牌:
import itertools
import random
def random_unique_combinations_k0(items, k):
# generate all possible combinations
combinations = list(itertools.product(*[item for item in items]))
# shuffle them
random.shuffle(combinations)
for combination in itertools.islice(combinations, k):
yield combination
编辑 1:
如果生成所有组合在内存方面过于昂贵,您可能需要反复试验并拒绝非唯一组合。
一种方法是:
import itertools
import random
import functools
def prod(items):
return functools.reduce(lambda x, y: x * y, items)
def random_unique_combinations_k1(items, k):
max_lens = [len(list(item)) for item in items]
max_num_combinations = prod(max_lens)
# use `set` to ensure uniqueness
index_combinations = set()
# make sure that with the chosen number the next loop can exit
# WARNING: if `k` is too close to the total number of combinations,
# it may take a while until the next valid combination is found
while len(index_combinations) < min(k, max_num_combinations):
index_combinations.add(tuple(
random.randint(0, max_len - 1) for max_len in max_lens))
# make sure their order is shuffled
# (`set` seems to sort its content)
index_combinations = list(index_combinations)
random.shuffle(index_combinations)
for index_combination in itertools.islice(index_combinations, k):
yield tuple(item[i] for i, item in zip(index_combination, items))
(这也可以仅通过列表实现,并在添加 combination 之前检查唯一性,也使 random.shuffle() 变得多余,但根据我的测试,这些比使用 sets 慢。)
编辑 2:
可能最不占用内存的方法是对生成器进行实际洗牌,然后在它们上使用itertools.product()。
import random
import itertools
def pseudo_random_unique_combinations_k(items, k):
# randomize generators
comb_gens = list(items)
for i, comb_gen in enumerate(comb_gens):
random.shuffle(list(comb_gens[i]))
# get the first `num` combinations
combinations = list(itertools.islice(itertools.product(*comb_gens), k))
random.shuffle(combinations)
for combination in itertools.islice(combinations, k):
yield tuple(combination)
这显然会牺牲一些随机性。
编辑 3:
按照@Divakar 的方法,我又写了一个版本,看起来比较高效,但很可能会受到random.sample() 的能力的限制。
import random
import functools
def prod(items):
return functools.reduce(lambda x, y: x * y, items)
def random_unique_combinations_k3(items, k):
max_lens = [len(list(item)) for item in items]
max_num_combinations = prod(max_lens)
for i in random.sample(range(max_num_combinations), k):
index_combination = []
for max_len in max_lens:
index_combination.append(i % max_len)
i = i // max_len
yield tuple(item[i] for i, item in zip(index_combination, items))
测试
在请求的输入上,它们的执行速度都相当快,0 方法最快(甚至比2 或pseudo 方法更快),1 方法最慢,并且, 3 方法介于两者之间。
sklearn.model_selection.ParameterSampler 方法的速度与方法 1 相当。
items = [v for k, v in hyperparams.items()]
num = 100
%timeit list(random_unique_combinations_k0(items, num))
615 µs ± 4.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(random_unique_combinations_k1(items, num))
2.51 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
179 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(random_unique_combinations_k3(items, num))
570 µs ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# the `sklearn` method which is slightly different in that it is
# also accessing the underling dictiornary
import from sklearn.model_selection import ParameterSampler
%timeit list(ParameterSampler(hyperparams, n_iter=num))
2.86 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
作为旁注,我会确保您的 hyperparams 是 collections.OrderedDict,因为不能保证在不同版本的 Python 中订购 dict。
对于稍大的物体,我们开始看到限制:
items = [range(50)] * 5
num = 1000
%timeit list(random_unique_combinations_k0(items, num))
# Memory Error
%timeit list(random_unique_combinations_k1(items, num))
19.3 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
1.82 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(random_unique_combinations_k3(items, num))
2.31 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
对于较大的物体更是如此:
items = [range(50)] * 50
num = 1000
%timeit list(random_unique_combinations_k0(items, num))
# Memory Error
%timeit list(random_unique_combinations_k1(items, num))
149 ms ± 3.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
4.92 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(random_unique_combinations_k3(items, num))
# OverflowError
总结:
方法0 可能不适合内存,方法1 是最慢的但它可能更健壮,方法3 在不遇到溢出问题的情况下提供最佳性能,而方法 2 (pseudo) 是最快且占用内存较少的方法,但它会产生一些“不那么随机”的组合。