在 Node.js 中对数组中的相似字符串进行分组答案

【问题标题】：Group similar strings from an array in Node.js在 Node.js 中对数组中的相似字符串进行分组
【发布时间】：2017-07-01 23:28:17
【问题描述】：

假设我有一个数组中不同 URL 的集合：

var source = ['www.xyz.com/Product/1', 'www.xyz.com/Product/3', 'www.xyz.com/Category/1', 'somestring']

什么是迭代数组并将相似字符串分组到单独数组中的好方法？上述示例的所需输出将是：

var output = [
    ['www.xyz.com/Product/1', 'www.xyz.com/Product/3'],
    ['www.xyz.com/Category/1'],
    ['somestring']
];

条件

source 中的所有项目都可以是随机字符串
逻辑必须能够在有意义的时间内比较和分组大约 100,000 个项目

我找到了string-similarity library，它可以将一个字符串与一组字符串进行比较。一种方法是遍历源，将每个项目与源集合进行比较，并应用规则对具有相似分数的项目进行分组。但是我想这将是非常低效的。

有人可以建议我一种有效的方法来完成我的需要吗？

【问题讨论】：

所以在这个例子中有一个清晰的模式，但看起来你问的字符串可能是什么？对吗？
@aw04 是的，没有明确的模式，字符串可以是任何东西。正如我所写：源中的所有项目都可以是随机字符串
祝你好运:)
只是一个注释，类似的分数想法太简单了，你只看到一个字符串如何与另一个字符串相关，而不是它们如何相互关联。我唯一能想到的就是第一次通过某种方式找出不同的数组组，但这听起来像是一个非常复杂的算法
你是完全正确的，但我认为必须已经存在一种算法来完成这种比较（mb 未在 Node.js 中实现），我只是不知道。所以我希望有人把我推向正确的方向:)

标签： arrays node.js string comparison

【解决方案1】：

我根据 Dice 系数将 user7560588 的代码修改为用户字符串相似度，这在很大程度上优于 Levenshtein 距离。 https://www.npmjs.com/package/string-similarity.

您可以将接受率从 0 调整到 1，因为 1 是 100% 匹配。因此，您可以更好地设置正确的接受值。

它的作用是循环数组中的值并比较 2 个字符串，如果匹配则将它们分组。该库还可以将字符串与字符串数组进行比较，并在数组中返回相应的评分。

var stringSimilarity = require("string-similarity");

const stringFilter = (source, rate = 0.85) => {
  let _source, matches, x, y;
  _source = source.slice();
  matches = [];
  for (x = _source.length - 1; x >= 0; x--) {
    let output = _source.splice(x, 1);

    for (y = _source.length - 1; y >= 0; y--) {
      var match = stringSimilarity.compareTwoStrings(output[0], _source[y]);
      console.log(output[0], _source[y], match);
      if (match > rate) {
        output.push(_source[y]);
        _source.splice(y, 1);
        x--;
      }
    }
    matches.push(output);
  }
  return matches;
};

let source = ['www.xyz.com/Product/1', 'www.xyz.com/Product/3', 'www.xyz.com/Category/1', 'somestring'];
let output = stringFilter(source);
console.log(output);

结果

somestring www.xyz.com/Category/1 0.06666666666666667
somestring www.xyz.com/Product/3 0.06896551724137931
somestring www.xyz.com/Product/1 0.06896551724137931
www.xyz.com/Category/1 www.xyz.com/Product/3 0.5365853658536586
www.xyz.com/Category/1 www.xyz.com/Product/1 0.5853658536585366
www.xyz.com/Product/3 www.xyz.com/Product/1 0.95
[
  [ 'somestring' ],
  [ 'www.xyz.com/Category/1' ],
  [ 'www.xyz.com/Product/3', 'www.xyz.com/Product/1' ]
]

【讨论】：

【解决方案2】：

我能想到的最佳解决方案是将字符串相互比较并测试它们的不同之处。有一种算法可以做到这一点，那就是Levenshtein distance 算法：

Levenshtein 距离是一个字符串度量，用于测量两个序列的区别。非正式地，Levenshtein 距离 between two words 是单个字符编辑的最小数量（即插入、删除或替换）需要更改一个词入对方。

我们可以轻松地在fast-levenshtein module 之上创建一个 Levenshtein 过滤器：

const levenshtein = require('fast-levenshtein'); 

const levenshteinFilter = (source, maximum = 5) => {
  let _source, matches, x, y;
  _source = source.slice();
  matches = [];
  for (x = _source.length - 1; x >= 0; x--) {
    let output = _source.splice(x, 1);
    for (y = _source.length - 1; y >= 0; y--) {
      if (levenshtein.get(output[0], _source[y]) <= maximum) {
        output.push(_source[y]);
        _source.splice(y, 1);
        x--;
      }
    }
    matches.push(output);
  }
  return matches;
}

let source = ['www.xyz.com/Product/1', 'www.xyz.com/Product/3', 'www.xyz.com/Category/1', 'somestring'];
let output = levenshteinFilter(source);
// [ [ 'www.xyz.com/Product/1', 'www.xyz.com/Product/3' ],
//   [ 'www.xyz.com/Category/1' ],
//   [ 'somestring' ] ]

您可以在函数的 2 参数中定义最大可接受距离（默认为 5）。

【讨论】：

尽管我提出了一个使用相同算法的库，但您的解决方案仍然有效。我还没有测量性能，但谢谢你的回答！
应该是levenshtein.get(？
@JohnJones 感谢您注意到这一点。

【解决方案3】：

根据您的示例测试，我可以建议您实现 Radix Tree or Prefix Tree 来存储字符串。之后，您可以定义一个标准来对这些字符串进行聚类。

【讨论】：

【解决方案4】：

MinHash (https://en.wikipedia.org/wiki/MinHash) 怎么样？

它旨在查找重复的网页。所以我想你可以 url.split('/') 并将每个 url 视为一组单词。

【讨论】：

这看起来很有趣。我要深入了解一下，谢谢！

【解决方案5】：

如果 source 包含所有随机 url，下面的函数将给出预期的输出，如问题所示。

function filter (source) {
  var output = []
  source.forEach((svalue) => {
    if (output.length === 0) {
      output.push([svalue])
    } else {
      var done = false
      output.forEach((tarr) => {
        if (!done) {
          tarr.forEach((tvalue) => {
            if (svalue.indexOf('/') > -1 && svalue.split('/').slice(0, 2).join('/') == tvalue.split('/').slice(0, 2).join('/')) {
              tarr.push(svalue)
              done = true
            }
          })
        }
      })
      if (!done) {
        output.push([svalue])
        done = true
      }
    }
  })
  return output
}

【讨论】：

【解决方案6】：

您不会充实自己的意图，但如果面临从随机干草堆中找到最近邻居的选定项目的任务，我可能会尝试构建一棵哈希树。

或者，这可能是作弊，我会让图书馆为我做这件事。 lunr.js 基本上是一个纯 JS lucene 索引，我会将您的数组推入其中并对其进行查询以获得类似的字符串。我之前在 lunr.js 中拥有过相当大的数据集，它的性能非常好，附近有一个 elasticsearch 集群简直无法比拟，但仍然令人印象深刻。

如果你提供更多关于你想要做什么的细节，我可以提供更多细节，甚至可能是一些示例代码。

【讨论】：