【发布时间】:2016-01-21 07:38:12
【问题描述】:
我有一个值,例如颜色,以及一个字符串列表:{颜色、颜色、主颜色、主颜色、主题、品牌、主题.....等}
我想得到最相似的字符串,除了搜索的字符串本身。在这个例子中,期望得到颜色。 (不是颜色)
我正在对列表进行排序 我正在使用以下规则并对规则进行排名:
- 过滤相同的值
- 检查大小写
- 删除空格。修剪
- 使用 Levenshtein 距离
- 字符串顺序:主色 = 主色
- 检查首字母缩略词:HP - Hewlett Packard
查看包含 1000 名相关候选人的列表需要花费大量时间。此外,我还有很多候选人要检查。
还有其他有效的方法吗?
原代码:
public static List findSimilarity(String word, List candidates) {
List recommendations = new ArrayList();
if (!word.equals("")) {
for (String candidate : candidates) {
if (!word.equals(candidate)) { //1. same token , lower/upper cases , ignore white spaces
if (StringUtils.deleteWhitespace(word).equalsIgnoreCase(StringUtils.deleteWhitespace(candidate))) {
recommendations.add(candidate);
}
//2. same tokens diff order
else if (candidate.split(" ").length == word.split(" ").length) {
String[] candidatearr = candidate.split(" ");
String[] wordarr = word.split(" ");
boolean status = true;
SortIgnoreCase icc = new SortIgnoreCase();
Arrays.sort(candidatearr, icc);
Arrays.sort(wordarr, icc);
for (int i = 0; i < candidatearr.length; i++) {
if (!(candidatearr[i] == null ? wordarr[i] == null : wordarr[i].equalsIgnoreCase(candidatearr[i])))
status = false;
}
if (status) {
recommendations.add(candidate);
}
}
}
}
//3. distance between words
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
String[] candidatearr = candidate.split(" ");
String[] wordarr = word.split(" ");
//check for acronym
if ((wordarr.length == 1 && candidatearr.length > 1) || (wordarr.length > 1 && candidatearr.length == 1)) {
String acronym = "";
if (wordarr.length > candidatearr.length) {
for (String tmp : wordarr) {
if (!tmp.equals("")) {
acronym = acronym + tmp.substring(0, 1);
}
}
if (acronym.equalsIgnoreCase(candidatearr[0])) {
recommendations.add(candidate);
}
} else {
for (String tmp : candidatearr) {
if (!tmp.equals("")) {
acronym = acronym + tmp.substring(0, 1);
}
}
if (acronym.equalsIgnoreCase(wordarr[0])) {
recommendations.add(candidate);
}
}
}
}
}
}
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
int dist = 0;
String check = "";
if (word.length() > candidate.length()) {
check = candidate;
} else {
check = word;
}
if (check.length() <= 3) {
dist = 0;
} else if (check.length() > 3 && check.length() <= 5) {
dist = 1;
} else if (check.length() > 5) {
dist = 2;
}
if (StringUtils.getLevenshteinDistance(word, candidate) <= dist) {
//if(Levenshtein.distance(word,candidate) <= dist){
recommendations.add(candidate);
}
}
}
}
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
String[] candidatearr = candidate.split(" ");
String[] wordarr = word.split(" ");
for (String cand : candidatearr) {
for (String wor : wordarr) {
if (cand.equals(wor) && cand.length() > 4) {
recommendations.add(candidate);
}
}
}
}
}//for
if (recommendations.size() > 4) {
recommendations.clear();
}
}
//4. low priority - starts with
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
if (candidate.startsWith(word) || word.startsWith(candidate)) {
recommendations.add(candidate);
}
}
}
if (recommendations.size() > 4) {
recommendations.clear();
}
}
//5. low priority - contain word
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
if (candidate.contains(word) || word.contains(candidate)) {
recommendations.add(candidate);
}
}
}
if (recommendations.size() > 4) {
recommendations.clear();
}
}
}
return recommendations;
}
谢谢, M.
【问题讨论】:
-
你可以使用例如 apache 提供的 soundex。
-
@KevinEsche 他将排序命名为 Levenshtein 距离,这是 apache StringUtils 实现的
-
您的案例看起来很适合使用带有过滤器的 Java 8 流。为了计算效率,您可以并行处理。你想看一个例子吗?
-
@nolexa 我很高兴看到一个例子。谢谢!
-
@KevinEsche 我错了。 Levenshtein 距离和 soundex 差异是两个独立的东西,虽然都是由 Apache 实现的。
标签: java string similarity