使用概率表生成长度为 K 的 N 个“随机”字符串答案

【问题标题】：Generate N "random" string of length K using probability table使用概率表生成长度为 K 的 N 个“随机”字符串
【发布时间】：2015-01-16 09:27:45
【问题描述】：

如何使用概率表创建长度为K的N“随机”字符串？ K 是偶数。

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}

假设K = 6，'acacab' 的概率会高于'aaaaaa'。

这是我用来根据概率表生成合成序列的一个更大问题的子问题。我不确定如何使用概率表生成“随机”字符串？

到目前为止我所拥有的：

def seq_prob(fprob_table,K= 6, N= 10):
    #fprob_table is the probability dictionary that you input
    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    #possibly using itertools or random to generate the semi-"random" strings based on the probabilities 
    return seq_list

【问题讨论】：

这是个好问题，随机模型序列真的很有用！

标签： python string random probability itertools

【解决方案1】：

at the end of the documentation for the builtin random module 描述了一些进行加权随机选择的好方法：

一个常见的任务是创建一个带有加权概率的 random.choice()。

如果权重是小整数比率，一种简单的技术是构建具有重复的样本总体：

>>> weighted_choices = [('Red', 3), ('Blue', 2), ('Yellow', 1), ('Green', 4)]
>>> population = [val for val, cnt in weighted_choices for i in range(cnt)]
>>> random.choice(population)
'Green'

更通用的方法是使用 itertools.accumulate() 将权重排列成累积分布，然后使用 bisect.bisect() 定位随机值：

>>> choices, weights = zip(*weighted_choices)
>>> cumdist = list(itertools.accumulate(weights))
>>> x = random.random() * cumdist[-1]
>>> choices[bisect.bisect(cumdist, x)]
'Blue'

为了使后一种方法适应您的具体问题，我会这样做：

import random
import itertools
import bisect

def seq_prob(fprob_table, K=6, N=10):
    choices, weights = fprob_table.items()
    cumdist = list(itertools.accumulate(weights))

    results = []
    for _ in range(N):
        s = ""
        while len(s) < K:
            x = random.random() * cumdist[-1]
            s += choices[bisect.bisect(cumdist, x)]
        results.append(s)

    return results

这假设您的概率表中的关键字符串都是相同的长度如果它们有多个不同的长度，则此代码有时（也许大多数时候！）给出的答案比K 字符长。我想它还假设 K 是密钥长度的精确倍数，但如果这不是真的，它实际上会起作用（它只会给出都长于 K 字符的结果字符串，因为没有办法得到K 完全正确）。

【讨论】：

请注意：itertools.accumulate() 是 Python 3.2 中的新功能。

【解决方案2】：

你可以使用random.random:

from random import random
def seq_prob(fprob_table, K=6, N=10):
    #fprob_table is the probability dictionary that you input
    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    s = ""
    while len(seq_list) < N:
        for k, v in fprob_table.items():
            if len(s) == K:
                seq_list.append(s)
                s = ""
                break
            rn = random()
            if rn <=  v:
                s += k
    return seq_list

这无疑可以改进，但random.random 在处理概率时很有用。

【讨论】：

我更喜欢这个，而不是像我一样建立一个列表。但是，我认为您需要确保对概率进行排序。像这样的东西应该可以工作，ordered_probs = sorted((prob, char_pair) for char_pair, prob in fprob_table.items())。

【解决方案3】：

我确信有一个 cleaner/更好的方法，但这里有一个简单的方法来做到这一点。

这里我们用 100 个单独的字符对值填充pick_list，值的数量由概率决定。在这种情况下，pick_list 中有 20 个'aa'、30 个'ab' 和 50 个'ac' 条目。然后random.choice(pick_list)统一从列表中拉取一个随机条目。

import random

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}


def seq_prob(fprob_table, K=6, N=10):
    #fprob_table is the probability dictionary that you input

    # fill list with number of items based on the probabilities
    pick_list = []
    for key, prob in fprob_table.items():
        pick_list.extend([key] * int((prob * 100)))    

    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    for i in range(N):
        sub_seq = "".join(random.choice(pick_list) for _ in range(int(K/2)))
        seq_list.append(sub_seq)
    return seq_list

有结果：

 seq_prob(prob_table)
['ababac',
 'aaacab',
 'aaaaac',
 'acacac',
 'abacac',
 'acaaac',
 'abaaab',
 'abaaab',
 'aaabaa',
 'aaabaa']

【讨论】：

【解决方案4】：

如果您的表或序列很大，使用 numpy 可能会有所帮助，因为它可能会明显更快。另外，numpy 就是为这类问题而构建的，而且方法很容易理解，只需 3 或 4 行代码。

这个想法是将概率转换为累积概率，即将(.2, .5, .3)映射到(.2, .7, 1.)，然后沿着从0到1的平面分布生成的随机数将落入具有与权重相对应的频率的累积和。 Numpy 的searchsorted 可用于快速找到随机值所在的 bin。也就是说，

import numpy as np

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}
N = 10
k = 3   # number of strings (not number of characters)

rvals = np.random.random((N, k))         # generate a bunch of random values
string_indices = np.searchsorted(np.cumsum(prob_table.values()), rvals)   # weighted indices
x = np.array(prob_table.keys())[string_indices]     # get the strings associated with the indices
y = ["".join(x[i,:]) for i in range(x.shape[0])]    # convert this to a list of strings

# y = ['acabab', 'acacab', 'acabac', 'aaacaa', 'acabac', 'acacab', 'acabaa', 'aaabab', 'abacac', 'aaabab']

这里我使用k 作为您需要的字符串数，而不是K 作为字符数，因为问题陈述对字符串/字符不明确。

【讨论】：