数组项的加权随机样本*无替换*答案

【问题标题】：Weighted random sample of array items *without replacement*数组项的加权随机样本*无替换*
【发布时间】：2021-03-10 17:26:17
【问题描述】：

需要 Javascript/ECMAScript 6 特定解决方案。

我想使用每个对象的加权值数组从对象数组中生成随机样本。人口列表包含人口的实际成员 - 而不是成员的类型。样本一旦被选中，就不能再被选中。

与我正在研究的问题类似的问题是模拟国际象棋锦标赛的可能结果。每个玩家的评分将是他们的体重。每位选手在每场比赛中只能获得一次（第 1、第 2 或第 3 名）。

要选择可能的前 3 名获奖者名单，可能如下所示：

let winners = wsample(chessPlayers,  // population
                      playerRatings, // weights
                      3);            // sample size

加权列表可能是也可能不是整数值。它可以是像[0.2, 0.1, 0.7, 0.3] 这样的浮点数，也可以是像[20, 10, 70, 30] 这样的整数。权重的总和不必等于 100%。

下面的 Peter 给了我一个关于通用算法的很好的参考，但是它并不特定于 JS：https://stackoverflow.com/a/62459274/7915759 它可能是一个很好的参考点。

依赖于生成第二个人口列表并复制每个成员权重次的问题的解决方案可能不是一个实际的解决方案。 weights 数组中的每个权重都可以是非常高的数字，也可以是分数；基本上，任何非负值。

一些额外的问题：

JS 中是否已经有 accumulate() 函数可用？
JS 中是否有bisect() 类型函数对排序列表进行二分查找？
是否有任何高效且低内存占用的 JS 模块提供包含上述解决方案的统计功能？

【问题讨论】：

这能回答你的问题吗？ What would be the fastest algorithm to randomly select N items from a list based on weights distribution?
我再看一下，但简要扫描一下，我认为您链接的答案可能不是 JS，ECMAScript 6 (?)
这个问题适用于任何编程语言，而不仅仅是 JavaScript。
我明白了。我相信，另一个答案中讨论的算法与上面的算法非常相似（已经看了一些）。尽管如此，它仍然给我留下了一些不满意的 JS 特定问题。另外，我认为最好有一个 JS 特定的问答，其中包含一些将来可以被其他 JS 开发人员轻松提取的代码。
您的问题到底是什么？如果这是工作代码并且您只是希望 cmets 对其进行改进，那么您可以将其提交给 codereview.stackexchange.com。但是，前提是它已经在工作。

标签： javascript random ecmascript-6

【解决方案1】：

这是一种方法，但不是最有效的。

使用binary indexed tree作为前缀和可以提高效率。

最高级别的功能。它迭代k 次，每次调用wchoice()。要从总体中删除当前选定的成员，我只需将其权重设置为 0。

/**
 * Produces a weighted sample from `population` of size `k` without replacement.
 * 
 * @param {Object[]} population The population to select from.
 * @param {number[]} weights    The weighted values of the population.
 * @param {number}   k          The size of the sample to return.
 * @returns {[number[], Object[]]} An array of two arrays. The first holds the
 *                                 indices of the members in the sample, and
 *                                 the second holds the sample members.
 */
function wsample(population, weights, k) {
    let sample  = [];
    let indices = [];
    let index   = 0;
    let choice  = null;
    let acmwts  = accumulate(weights);

    for (let i=0; i < k; i++) {
        [index, choice] = wchoice(population, acmwts, true);
        sample.push(choice);
        indices.push(index);

        // The below updates the accumulated weights as if the member
        // at `index` has a weight of 0, eliminating it from future draws.
        // This portion could be optimized. See note below.
        let ndecr = weights[index];
        for (; index < acmwts.length; index++) {
            acmwts[index] -= ndecr;
        }
    }
    return [indices, sample];
}

上面更新累积权重数组的代码部分是算法效率低下的地方。最坏的情况是 O(n - ?) 在每次通过时更新。这里的另一个解决方案遵循与此类似的算法，但使用binary indexed tree 来降低将前缀和更新为O(log n) 操作的成本。

wsample() 调用 wchoice() 从加权列表中选择一个成员。 wchoice() 生成一个累积权重数组，生成一个从 0 到权重总和的随机数（累积权重列表中的最后一项）。然后在累积权重中找到它的插入点；谁是赢家：

/**
 * Randomly selects a member of `population` weighting the probability each 
 * will be selected using `weights`. `accumulated` indicates whether `weights` 
 * is pre-accumulated, in which case it will skip its accumulation step.
 * 
 * @param {Object[]} population    The population to select from.
 * @param {number[]} weights       The weights of the population.
 * @param {boolean}  [accumulated] true if weights are pre-accumulated.
 *                                 Treated as false if not provided.
 * @returns {[number, Object]} An array with the selected member's index and 
 *                             the member itself.
 */
function wchoice(population, weights, accumulated) {
    let acm = (accumulated) ? weights : accumulate(weights);
    let rnd = Math.random() * acm[acm.length - 1];

    let idx = bisect_left(acm, rnd);

    return [idx, population[idx]];
}

这是我改编自https://en.wikipedia.org/wiki/Binary_search_algorithm的二分搜索算法的JS实现

/**
 * Finds the left insertion point for `target` in array `arr`. Uses a binary
 * search algorithm.
 * 
 * @param {number[]} arr    A sorted ascending array.
 * @param {number}   target The target value.
 * @returns {number} The index in `arr` where `target` can be inserted to
 *                   preserve the order of the array.
 */
function bisect_left(arr, target) {
    let n = arr.length;
    let l = 0;
    let r = n - 1;
    while (l <= r) {
        let m = Math.floor((l + r) / 2);
        if (arr[m] < target) {
            l = m + 1;
        } else if (arr[m] >= target) {
            r = m - 1;
        } 
    }
    return l;
}

没有找到现成的 JS 累加器函数，所以我自己写了一个简单的。

/**
 * Generates an array of accumulated values for `numbers`.
 * e.g.: [1, 5, 2, 1, 5] --> [1, 6, 8, 9, 14]
 * 
 * @param {number[]} numbers The numbers to accumulate.
 * @returns {number[]} An array of accumulated values.
 */
function accumulate(numbers) {
    let accm  = [];
    let total = 0;
    for (let n of numbers) {
        total += n;
        accm.push(total)
    }
    return accm;
}

【讨论】：

更新权重的一个优化：当您选择位置i的元素时，唯一需要更新的权重是位置i+1到length-1。因此，不要在每次选择一个元素时更新 all 权重，只需将它们从 i+1 开始更新，方法是将它们递减 weights[i]。
我已经更新了解决方案以减少一些累积开销@kmoser
这似乎有 O(kn) 的运行时间，其中 k 是样本中的项目数，n 是要从中抽样的项目数，这几乎不是最优的。即使不使用花哨的统计数据，也可以找到一种 O(n + k log n) 的算法。当然，这是否值得取决于 n 和 k 有多大......
查看我刚刚发布的答案。

【解决方案2】：

以下实现从n 元素中选择k，无需替换，具有加权概率，时间为O(n + k log n)，方法是将剩余元素的累积权重保存在sum heap 中：

function sample_without_replacement<T>(population: T[], weights: number[], sampleSize: number) {

    let size = 1;
    while (size < weights.length) {
        size = size << 1;
    }

    // construct a sum heap for the weights
    const root = 1;
    const w = [...new Array(size) as number[], ...weights, 0];
    for (let index = size - 1; index >= 1; index--) {
        const leftChild = index << 1;
        const rightChild = leftChild + 1;
        w[index] = (w[leftChild] || 0) + (w[rightChild] || 0);
    }

    // retrieves an element with weight-index r 
    // from the part of the heap rooted at index
    const retrieve = (r: number, index: number): T => {
        if (index >= size) {
            w[index] = 0;
            return population[index - size];
        } 
        
        const leftChild = index << 1;
        const rightChild = leftChild + 1;

        try {
            if (r <= w[leftChild]) {
                return retrieve(r, leftChild);
            } else {
                return retrieve(r - w[leftChild], rightChild);
            }
        } finally {
            w[index] = w[leftChild] + w[rightChild];
        }
    }

    // and now retrieve sampleSize random elements without replacement
    const result: T[] = [];
    for (let k = 0; k < sampleSize; k++) {
        result.push(retrieve(Math.random() * w[root], root));
    }
    return result;
}

代码是用 TypeScript 编写的。您可以在TypeScript playground 中将其转换为您需要的任何版本的 EcmaScript。

测试代码：

const n = 1E7;
const k = n / 2;
const population: number[] = [];
const weight: number[] = [];
for (let i = 0; i < n; i++) {
    population[i] = i;
    weight[i] = i;
}

console.log(`sampling ${k} of ${n} elments without replacement`);
const sample = sample_without_replacement(population, weight, k);
console.log(sample.slice(0, 100)); // logging everything takes forever on some consoles
console.log("Done")

在 Chrome 中执行，这会在大约 10 秒内从 10 000 000 个条目中抽取 5 000 000 个样本。

【讨论】：

您使用什么已知算法对此进行建模，您知道它是统计上准确的还是近似储层样本？如果您有时间，能否在代码中添加一些 cmets 以识别不同的部分（堆插入、弹出等）？
这与您使用的算法几乎相同，除了我在总和堆中累积权重，因此我可以在 O(log n) 而不是 O(n) 中更新部分总和。因此，统计分布应该是相同的。我添加了一些 cmets 和一个指向解释该技术的博客文章的链接。
hmmm.. 我不知道如何在当前选择之后更新所有后续的 accm 权重。我多次运行您的代码来分析它，即使在我修改了代码以减少 accm wts 更新之后，您的更新也比我的要少得多。我不确定堆如何在不丢失某些元素的情况下减少开销。不过，您的代码似乎产生了有效的结果 =/
那么你应该阅读我链接的博客文章:-)。基本思想是，我不是像((((((a+b)+c)+d)+e)+f)+g)+h那样从左到右计算总和，而是像((a+b)+(c+d))+((e+f)+(g+h))那样计算它。这意味着当输入发生变化时，只有 O(log n) 中间结果受到影响，这使我可以通过跟踪中间结果来快速计算新的总和。此外，我可以使用这些中间结果快速重新计算 O(log n) 部分和二进制搜索访问，从而将检索到的每个元素的整体运行时间降低到 O(log n)。
是的.. 我只是在看那个。很好的参考 - 谢谢！