Javascript：优化 `reduce` 以提高性能答案

【问题标题】：Javascript: Optimizing `reduce` for performanceJavascript：优化 `reduce` 以提高性能
【发布时间】：2016-12-20 09:34:01
【问题描述】：

我正在使用.reduce 方法遍历对象数组，以便返回最适合特定条件的对象的数组索引。我的数组现在有大约 30,000 个索引，而且我的目标是超过一百万。麻烦的是，使用.reduce 遍历数组需要永远！！！我们现在谈了将近 4 秒，想象一下如果数组有我预计的 100 万个索引。阵列紧凑。我没有连接到数据库或服务器。这是我的代码：

 var startMatchMaking = function () {
    var loopCounter = 0;
    var length = personArray.length;
    do {
        var manCounter = 0;
        loopCounter++;
        for (var i = length; i--;){
            if (!personArray[i].isSingle && personArray[i].sex === "Male" &&
                personArray[i].isAvailable === true) {
                manCounter++;            
                var num = normalRandomScaled(2.1, 12.44);

                var result = personArray.reduce(function(p,c,k,a){
                    return c.sex !== personArray[i].sex &&
                    !c.isSingle && c.isAvailable === true &&
                    c.age <= (personArray[i].age + num) &&
                    c.age >= (personArray[i].age - num) ? k : p;
                }, 0);

                result = !personArray[result].isSingle && 
                    personArray[result].sex !== personArray[i].sex &&
                    personArray[result].age <= (personArray[i].age + num) &&
                    personArray[result].age >= (personArray[i].age - num) ? result : -1;

                if (result >= 0) {
                    householdArray.push (new Household (personArray[i], personArray[result]));
                    personArray[result].isAvailable = false;
                    personArray[i].isAvailable = false;
                }
            }
        }
        document.write("<br>Mancounter is: " + manCounter +
                " loopCounter is: " + loopCounter + " households: " + householdArray.length);
    }
    while (manCounter > 0 && loopCounter <= 5);
};

startMatchMaking();

上下文：我正在尝试开发一个独立的应用程序来运行基于代理的模型人口统计模拟。 personArray 基本上包含 30,000 个人。上面的特定代码位与种群的初始设置有关。 Persons 之前已创建并推送到阵列。每个Person 对象都有一个firstName、lastName、sex、age 和isSingle 属性。他们为每个人分配了随机值。在项目的这个阶段，我需要把注定不是单身的Persons，与一个合适的异性配偶和年龄相配的人配对成家庭。

如何优化它以显着提高性能？我愿意接受小的更改或完全不同的替代方案，它们会输出相同的result。

【问题讨论】：

如果您在另一个循环中调用该代码，那么那是您的问题。
您是否正确使用reduce？好像k 被用在了一个奇怪的地方，i 是从哪里来的？
无论您对这么多数据做什么，都应该在数据库中完成，而不是使用 js。特别是因为您的代码看起来很像查询，而不是缩减。
好的，你去。您正在迭代personArrray，并且在每次迭代中，您都通过.reduce() 调用迭代再次。这意味着.reduce() 回调将被调用九亿次。加快.reduce() 回调的任何操作都无济于事。
基本上@Pointy 的意思是：在您使用蛮力的那一刻，您想要做的是寻找更好的聚类策略。第一步是将您的数据分成sex, single, available 类别，并仅搜索正确的类别，即消除personArray.length 时间检查错误类别的开销。第二步可以按年龄范围划分这些类别。这使您可以仅搜索与您感兴趣的年龄范围相交的范围。

标签： javascript arrays performance optimization iteration

【解决方案1】：

您使用reduce 并以这种方式在一个循环中遍历所有元素，该循环还遍历 cmets 中已经提到的元素。这导致二次复杂度。这意味着如果将人数增加一倍，则算法的运行时间将乘以 4。因此，以这种方式处理数百万人是完全不可行的。

在我看来，没有必要对所有元素进行内部迭代。您可以用普通循环替换reduce，并在找到匹配项时停止迭代。当前解决方案采用最后找到的匹配项。有什么东西可以让最后一个比第一个更好吗？还是我错过了什么？在搜索匹配时随机选择一些索引并在找到匹配时停止呢？这是一个不需要太多改变的解决方案，我希望它会产生很大的不同，除了非常年轻和非常年长的人（异常值）。

需要更多更改的解决方案是通过那里的属性映射人员，就像在 cmets 中已经提到的类似，这样您就可以执行类似matchCandidates = people[oppositeSex][ageWithSomeRandomness] 的操作。请查看this post，了解有关在 Javascript 中可能实现的地图和哈希表的更多信息。

可以通过在开始时过滤人员来实现额外的改进，以便不包括单身人士，即。 e.将不是单身的人复制到一个新数组中，并且只在算法中访问新数组。

如果您的代码在浏览器中运行，您可以使用web workers 来避免浏览器冻结。

【讨论】：

【解决方案2】：

我认为您需要进行一些预处理以加快速度。

例如：

将人口分为男性和女性
按年龄对男性进行排序
迭代已排序的男性数组并计算一组新的匹配女性仅在处理新年龄时
只需从当前集合中挑选女性，而它是最新的而不是空的

编辑：我们可以选择性地通过按年龄差异对女性集合进行排序来优化匹配数量，以便首先创建年龄差异较小的夫妇。

下面是一些示例代码。

var personArray = [];

// create test population
for(var n = 0; n < 30000; n++) {
  personArray.push({
    isSingle: Math.random() < 0.5,
    age: Math.round(18 + Math.random() * 80),
    sex: Math.random() < 0.5 ? 'M' : 'F',
    isAvailable: true
  });
}

var num = 7, // instead of num = normalRandomScaled(2.1, 12.44)
    sex = [ [], [] ],
    curAge = -1, subset,
    houseHold = [],
    ts = performance.now();

// split population into men & women
personArray.forEach(function(p) {
  sex[p.sex == 'M' ? 0 : 1].push(p);
});

// sort men by age
sex[0].sort(function(a, b) { return a.age - b.age; });

// iterate on men
sex[0].forEach(function(m) {
  if(m.age != curAge) {
    // create set of matching women for this age
    subset = sex[1].filter(function(w) {
      return w.isAvailable && w.isSingle && Math.abs(m.age - w.age) <= num;
    });
    // sort by age difference, so that women with
    // a small age difference are picked first
    subset.sort(function(a, b) {
      return Math.abs(m.age - b.age) - Math.abs(m.age - a.age);
    });
    curAge = m.age;
  }
  if(m.isSingle && subset.length) {
    // pick woman from set
    var w = subset.pop();
    m.isAvailable = false; // (not really necessary)
    w.isAvailable = false;
    houseHold.push([ m, w ]);
  }
});

console.log(
  'Found ' + houseHold.length + ' matches ' +
  'in ' + Math.round(performance.now() - ts) + 'ms'
);
console.log(
  'Random example:',
  houseHold[(Math.random() * houseHold.length) | 0]
);

【讨论】：

有趣，我想我可以使用类似的东西。唯一的事情是我真的希望我的householdArray 包含具有不同属性的Household 对象。其中两个属性将是man 和woman，它们的值将是来自我的personArray 的适当Person 对象。 Person 对象也将具有 household 属性，并且该值将是来自 householdArray 的适当 Household 对象。这样，我可以做类似personArray[x].household.woman 或household[x].man.age 的事情。在我的脑海中，一切都非常整洁有序。
@neoflash - 当然。我稍微简化了您的原始代码，以便它可以在没有定义 Household 对象的情况下作为 sn-p 执行。只需使用householdArray.push(new Household(m, w))，您就可以开始使用了。

【解决方案3】：

正如其他人和我在 cmets 中所述：您的方法是蛮力，在这种情况下，输入大小是二次方的。有几种优化的可能性。对于二进制值（即布尔值），将数组分成类别是微不足道的。像年龄这样的数值可能会被聚类，例如进入范围。并且您应该明确采用 mm759 提到的提前中止。 TLDR：底部有一个表格和结论。

考虑蛮力方法（供参考）：

// The result is a list of matches [[candidate.id, match.id (or -1)]]
function bruteforce(arr) {
  var matches = [];
  for(var i = 0; i < arr.length; ++i) {
    var candidate = arr[i], num = 12;
    var isCandidate = !candidate.isSingle && candidate.isAvailable && candidate.sex == 1;
    var cSex = candidate.sex;
    var cSexPref = candidate.sexPref;
    var cAgeMin = candidate.age - num;
    var cAgeMax = candidate.age + num;
    var result = !isCandidate ? -1 : arr.reduce(function(p,c,k,a){
      return k != i &&
        c.sex == cSexPref &&
        c.sexPref == cSex &&
        !c.isSingle && c.isAvailable &&
        c.age <= cAgeMax &&
        c.age >= cAgeMin ? k : p;
    }, -1);
    if(isCandidate)
      matches.push([i, result]);
  }
  return matches;
}

类别方法可能如下所示：

function useTheCategory(arr) {
  // preprocessing the data
  var wmNonsingleAvailables = [];
  var wwNonsingleAvailables = [];
  var mwNonsingleAvailables = [];
  var mmNonsingleAvailables = [];
  // split the data into categories
  arr.forEach(function(c) {
    if(!c.isSingle && c.isAvailable) {
      if(c.sex == 0) {
        if(c.sexPref == 1)
          wmNonsingleAvailables.push(c);
        else
          wwNonsingleAvailables.push(c);
      } else {
        if(c.sexPref == 0)
          mwNonsingleAvailables.push(c);
        else
          mmNonsingleAvailables.push(c);
      }
    }
  });

  var matches = [];
  for(var i = 0; i < arr.length; ++i) {
    var candidate = arr[i], num = 12;
    var isCandidate = !candidate.isSingle && candidate.isAvailable && candidate.sex == 1;
    var cSex = candidate.sex;
    var cSexPref = candidate.sexPref;
    var cAgeMin = candidate.age - num;
    var cAgeMax = candidate.age + num;
    if(isCandidate) {
      var category = null;
      // find the relevant category (in this case)
      // a more complex approach/split might include multiple categories here
      if(cSex == 0) {
        if(cSexPref == 1)
          category = mwNonsingleAvailables;
        else if(cSexPref == 0)
          category = wwNonsingleAvailables;
      } else if(cSex == 1) {
        if(cSexPref == 0)
          category = wmNonsingleAvailables;
        else if(cSexPref == 1)
          category = mmNonsingleAvailables;
      }
      var result = -1;
      if(category == null) {
        // always handle the error case...
        console.log("logic error: missing category!");
        console.log("candidate: " + JSON.stringify(candidate));
      } else {
        // the tests for matching sex/single/availability are left-overs and not necessarily required,
        // they are left in here to show that the reduce is not the culprit of your slowdown
        var match = category.reduce(function(p,c,k,a){
          return c.id != i &&
            c.sex == cSexPref &&
            c.sexPref == cSex &&
            !c.isSingle && c.isAvailable &&
            c.age <= cAgeMax &&
            c.age >= cAgeMin ? k : p;
        }, -1);
        // translate to arr index
        if(match != -1)
          result = category[match].id;
      }
      matches.push([i, result]);
    }
  }
  return matches;
}

年龄范围桶方法可能如下所示：

function useAgeRange(arr) {
  // preprocessing the data
  var ranges = [1, 2, 3, 4, 5]; // find appropriate ranges to spread the entries evenly (analyse your data, more below...)
  var ageBuckets = [];
  // find the range of age values
  var ageRange = arr.length == 0 ? [0, 0] : arr.reduce(function(p,c) {
    var min = c.age < p[0] ? c.age : p[0];
    var max = c.age > p[1] ? c.age : p[1];
    return [min, max];
  }, [arr[0].age, arr[0].age]);
  // build the buckets (floor for nicer values)
  for(var age = Math.floor(ageRange[0]), maxAge = ageRange[1], step = 0; age <= maxAge; age += step) {
    // update step size
    if(step == 0)
      step = ranges[0];
    else
      step = ranges[Math.min(ranges.length - 1, ranges.indexOf(step) + 1)];
    ageBuckets.push({
      nextAge: age + step,
      bucket: [],
    });
  }
  function findBucketIndex(age) {
    // min i with age < ageBuckets[i].nextAge
    for(var i = 0, maxi = ageBuckets.length - 1; i < maxi; ++i)
      if(age < ageBuckets[i + 1].nextAge)
        return i;
    return -1;
  }
  arr.forEach(function(c) {
    ageBuckets[findBucketIndex(c.age)].bucket.push(c);
  });

  var matches = [];
  for(var i = 0; i < arr.length; ++i) {
    var candidate = arr[i], num = 12;
    var isCandidate = !candidate.isSingle && candidate.isAvailable && candidate.sex == 1;
    var cSex = candidate.sex;
    var cSexPref = candidate.sexPref;
    var cAgeMin = candidate.age - num;
    var cAgeMax = candidate.age + num;
    if(isCandidate) {
      // Find range intersection with ageBuckets
      var startBucket = findBucketIndex(cAgeMin);
      var endBucket = findBucketIndex(cAgeMax);
      if(startBucket < 0) startBucket = 0;
      if(endBucket < 0) endBucket = ageBuckets.length - 1;
      var result = -1;
      // now only search those candidate buckets
      for(var b = startBucket; b <= endBucket; ++b) {
        var bucket = ageBuckets[b].bucket;
        var match = bucket.reduce(function(p,c,k,a){
          return c.id != i &&
            c.sex == cSexPref &&
            c.sexPref == cSex &&
            !c.isSingle && c.isAvailable &&
            c.age <= cAgeMax &&
            c.age >= cAgeMin ? k : p;
        }, -1);
        // translate to arr index
        if(match >= 0)
          result = bucket[match].id;
      }
      matches.push([i, result]);
    }
  }
  return matches;
}

我创建了一个基准测试来显示两种方法 on jsfiddle 的改进。两者本身都是有效的（即使包括预处理，值也会因系统和浏览器而异）：

N       Search space  Brute force  Categories  Range buckets
         (#matches)          (relative timing values)
20000   2500          200          34          140
40000   5000          1400         180         556
80000   10000         5335         659         2582
160000  20000         17000        2450        16900

分析您的数据以找出适合的方法就是一切：我的基准测试生成指数分布（18-20 岁是 28% 的数据点，21-32 岁是另外 27%，33-52 岁是另外 27% 和 53- 77 剩下约 18%）。正如我们在上面的时序中看到的那样，范围方法不能很好地处理这种分布（这是针对固定的num = 12 年和 14 个存储桶），因为对于大多数查询，24 岁的年龄范围涵盖了 55% 的数据.

【讨论】：