具有自定义距离的一维字符串聚类 - ruby答案

【问题标题】：One dimensional string clustering with custom distance - ruby具有自定义距离的一维字符串聚类 - ruby
【发布时间】：2016-12-20 04:56:09
【问题描述】：

我有一个字符串数组，它们是来自多个商店的产品名称。现在，我需要对该数组进行聚类，以获取包含相同产品的聚类，而不管列出的商店如何。

例如：

data = ["Laptop Asus xd45jkl", 
        "Laptop Acer d3000",
        "Notebooh Hp hxsss", 
        "Laptop Asus xd45jkl intel core i7", 
        "Laptop Acer d3000 intel core i5 4gb RAM"
]
desired_output = [["Laptop Asus xd45jkl", Laptop Asus xd45jkl intel core i7],
          ["Laptop Acer d3000", "Laptop Acer d3000 intel core i5 4gb RAM"]
          [""Notebooh Hp hxsss""]
         ]

我想从amatch gem 到 JaroWinkler 的产品名称之间的距离。是否有类似 k-means 的算法或其他算法可以生成该字符串数组的聚类？

【问题讨论】：

标签： ruby artificial-intelligence k-means string-aggregation

【解决方案1】：

我想到了这样的事情：

data = ["Laptop Asus xd45jkl", "Laptop Acer d3000", "Notebooh Hp hxsss", "Laptop Asus xd45jkl intel core i7", "Laptop Acer d3000 intel core i5 4gb RAM" ]
clusters = Hash.new

data.each do |item|
    brand =  item.split[1]
    clusters[brand] = [] if clusters[brand].nil?
    clusters[brand] << item
end

clusters.map! { |k, v| v }

我不确定这是否k-means 兼容以及它在大型数据集上的性能。

编辑： 50,000 个项目大约需要 2 秒。

【讨论】：