【问题标题】：Grouping words that are similar对相似的词进行分组
【发布时间】：2016-02-29 03:04:36
【问题描述】：

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')

我想要得到：

CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow

但绝对没问题：

CompanyName2
1
1
1
2
2
3
3

我看到了获取两个单词之间距离的算法，所以如果我只有一个奇怪的名字，我会将它与所有其他名字进行比较，然后选择距离最小的那个。但是我有成千上万个名字，我想把它们全部分组。

我对弹性搜索一无所知，但是elastic 包中的某个函数或其他一些函数可以帮助我吗？

很抱歉，这里没有编程。我知道。但这超出了我的正常专业领域。

【问题讨论】：

你可以用谷歌搜索“模糊匹配”。对于任何类型的输入，根本没有办法做到这一点。有很多不同公司名称非常相似的例子。
你可以试试adist函数（近似字符串距离）。

标签： r elasticsearch nlp

【解决方案1】：

解决方法：使用字符串距离

你在正确的轨道上。这里有一些 R 代码可以帮助您入门：

install.packages("stringdist") # install this package
library("stringdist") 
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")

让我们来看看。这些是使用最长公共子序列度量（尝试其他，例如 cosine、Levenshtein）计算的字符串之间的距离。从本质上讲，它们都测量字符串有多少个共同字符。他们的利弊超出了这个问答。您可能会研究为包含完全相同子字符串的两个字符串（如 dow）提供更高相似度值的东西

sdm[1:5,1:5]
            kraft kraft foods kfraft nestle nestle usa
kraft           0           6      1      9         13
kraft foods     6           0      7     15         15
kfraft          1           7      0     10         14
nestle          9          15     10      0          4
nestle usa     13          15     14      4          0

一些可视化

# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))

如果您想将其明确分组为 k 个组，请使用 k-medoids。

library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

【讨论】：