基于相似词序列的字符串聚类答案

【问题标题】：Clustering Strings Based on Similar Word Sequences基于相似词序列的字符串聚类
【发布时间】：2015-04-05 21:27:17
【问题描述】：

我正在寻找一种有效的方法，根据相似词序列的出现将大约 1000 万个字符串聚类成簇。

考虑一个字符串列表，例如：

the fruit hut number one
the ice cre  am shop number one
jim's taco
ice cream shop in the corner
the ice cream shop
the fruit hut
jim's taco outlet number one
jim's t  aco in the corner
the fruit hut in the corner

算法在它们上运行后，我希望它们按如下方式聚集：

the ice cre  am shop number one
ice cream shop in the corner
the ice cream shop

jim's taco
jim's taco outlet number one
jim's t  aco in the corner

the fruit hut
fruit hut number one
the fruit hut in the corner

很明显，区分簇的序列是：

ice cream shop, jim's taco and fruit hut

【问题讨论】：

什么编程语言？
其实这并不重要。我对聚类算法等更感兴趣。
你试过什么？例如，在 scikit-learn 中实现了 count 和 tf-idf 模型。
@IVlad 我对这个领域很陌生。我正在寻找方法（例如：stackoverflow.com/questions/7196053/…）。

标签： algorithm machine-learning nlp cluster-analysis

【解决方案1】：

集群不适合您。

对于任何无监督算法，下面的划分都一样好：

the fruit hut number one
the ice cre am shop number one
jim's taco outlet number one

the ice cream shop
the fruit hut
jim's taco

ice cream shop in the corner
jim's t aco in the corner
the fruit hut in the corner

因为对于聚类算法，“第一”和“在角落”也是共享短语。第二个集群是剩菜。

改用监督的东西。

【讨论】：

【解决方案2】：

我认为您正在寻找 Near Duplicates Detection，具有一些未知阈值，您不仅可以将“接近重复”聚类，还可以将足够相似的文档聚集在一起。

一种已知的解决方案是使用 Jaccard-Similarity 来获取两个文档之间的差异。

Jaccard 相似度基本上是 - 从每个文档中获取单词集，让这些集合为 s1 和 s2 - 并且 jaccard 相似度为 |s1 [intersection] s2|/|s1 [union] s2|。

通常在面对近乎重复的内容时 - 然而，单词的顺序有一定的重要性。为了处理它——在生成集合s1 和s2 时——你实际上生成了k-shinglings（或k-grams）的集合，而不是只有单词的集合。
在您的示例中，使用k=2，集合将是：拐角处的冰淇淋店

s2 = { the ice, ice cre, cre am, am shop, shop number, number one }
s4 = {ice cream, cream shop, shop in, in the, the corner }
s5 = { the ice, ice cream, cream shop }

s4 [union] s5 = { ice cream, cream shop, shop in, in the, the corner, the ice } 
s4 [intersection] s5 = { ice cream, cream shop }

在上面，jaccard-similarity 将是2/6。
在您的情况下，普通的 k-shingling 可能比使用单个单词 (1-shingling) 的性能更差，但您必须测试这些方法。

此过程可以很好地扩展以非常有效地处理大量集合，而无需检查所有对并创建大量集合。更多细节可以在these lecture notes找到（我在大约2年前做过这个讲座，基于作者的笔记）。

完成此过程后，您基本上就有了一个度量 d(s1,s2) 来测量每两个句子之间的距离，您可以使用任何已知的 clustering 算法对它们进行聚类。

免责声明：在意识到附近的重复可能适合此处后，使用我来自 this thread 的回答作为此基础。

【讨论】：

这实际上是我正在探索的一种方法，但是不确定，需要验证。我希望我能够按照您描述的方式遵循这种方法，并尝试将我的数据集放入其中。
嗨，我正在寻找适用于有序序列数据的算法。这里使用的算法非常适合无序的分类数据。你能帮我一个忙，给我推荐一些有序序列数据的方法吗？
@SHUYULYU 对于有序序列，您可能需要levenshtein distance 的变体。