找到最相似值的有效方法答案

【问题标题】：Efficient Way to find most similar value找到最相似值的有效方法
【发布时间】：2016-01-21 07:38:12
【问题描述】：

我有一个值，例如颜色，以及一个字符串列表：{颜色、颜色、主颜色、主颜色、主题、品牌、主题.....等}

我想得到最相似的字符串，除了搜索的字符串本身。在这个例子中，期望得到颜色。（不是颜色）

我正在对列表进行排序我正在使用以下规则并对规则进行排名：

过滤相同的值
检查大小写
删除空格。修剪
使用 Levenshtein 距离
字符串顺序：主色 = 主色
检查首字母缩略词：HP - Hewlett Packard

查看包含 1000 名相关候选人的列表需要花费大量时间。此外，我还有很多候选人要检查。

还有其他有效的方法吗？

原代码：

public static List findSimilarity(String word, List candidates) {
    List recommendations = new ArrayList();
    if (!word.equals("")) {
        for (String candidate : candidates) {
            if (!word.equals(candidate)) { //1. same token , lower/upper cases , ignore white spaces
                if (StringUtils.deleteWhitespace(word).equalsIgnoreCase(StringUtils.deleteWhitespace(candidate))) {
                    recommendations.add(candidate);
                }
                //2. same tokens diff order
                else if (candidate.split(" ").length == word.split("     ").length) {
                    String[] candidatearr = candidate.split(" ");
                    String[] wordarr = word.split(" ");
                    boolean status = true;
                    SortIgnoreCase icc = new SortIgnoreCase();
                    Arrays.sort(candidatearr, icc);
                    Arrays.sort(wordarr, icc);
                    for (int i = 0; i < candidatearr.length; i++) {
                        if (!(candidatearr[i] == null ? wordarr[i] == null : wordarr[i].equalsIgnoreCase(candidatearr[i])))
                            status = false;
                    }

                    if (status) {
                        recommendations.add(candidate);
                    }
                }
            }
        }
        //3. distance between words
        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    String[] candidatearr = candidate.split(" ");
                    String[] wordarr = word.split(" ");
                    //check for acronym
                    if ((wordarr.length == 1 && candidatearr.length > 1) || (wordarr.length > 1 && candidatearr.length == 1)) {
                        String acronym = "";
                        if (wordarr.length > candidatearr.length) {
                            for (String tmp : wordarr) {
                                if (!tmp.equals("")) {
                                    acronym = acronym + tmp.substring(0, 1);
                                }
                            }

                            if (acronym.equalsIgnoreCase(candidatearr[0])) {
                                recommendations.add(candidate);
                            }
                        } else {
                            for (String tmp : candidatearr) {
                                if (!tmp.equals("")) {
                                    acronym = acronym + tmp.substring(0, 1);
                                }
                            }

                            if (acronym.equalsIgnoreCase(wordarr[0])) {
                                recommendations.add(candidate);
                            }
                        }
                    }
                }
            }
        }

        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    int dist = 0;
                    String check = "";
                    if (word.length() > candidate.length()) {
                        check = candidate;
                    } else {
                        check = word;
                    }
                    if (check.length() <= 3) {
                        dist = 0;
                    } else if (check.length() > 3 && check.length() <= 5) {
                        dist = 1;
                    } else if (check.length() > 5) {
                        dist = 2;
                    }

                    if (StringUtils.getLevenshteinDistance(word, candidate) <= dist) {
                        //if(Levenshtein.distance(word,candidate) <= dist){
                        recommendations.add(candidate);
                    }
                }
            }
        }

        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    String[] candidatearr = candidate.split(" ");
                    String[] wordarr = word.split(" ");

                    for (String cand : candidatearr) {
                        for (String wor : wordarr) {
                            if (cand.equals(wor) && cand.length() > 4) {
                                recommendations.add(candidate);

                            }
                        }
                    }
                }
            }//for
            if (recommendations.size() > 4) {
                recommendations.clear();
            }
        }

        //4. low priority - starts with
        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    if (candidate.startsWith(word) || word.startsWith(candidate)) {
                        recommendations.add(candidate);
                    }
                }
            }
            if (recommendations.size() > 4) {
                recommendations.clear();
            }
        }

        //5. low priority - contain word
        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    if (candidate.contains(word) || word.contains(candidate)) {
                        recommendations.add(candidate);
                    }
                }
            }
            if (recommendations.size() > 4) {
                recommendations.clear();
            }
        }
    }
    return recommendations;
}

谢谢， M.

【问题讨论】：

你可以使用例如 apache 提供的 soundex。
@KevinEsche 他将排序命名为 Levenshtein 距离，这是 apache StringUtils 实现的
您的案例看起来很适合使用带有过滤器的 Java 8 流。为了计算效率，您可以并行处理。你想看一个例子吗？
@nolexa 我很高兴看到一个例子。谢谢！
@KevinEsche 我错了。 Levenshtein 距离和 soundex 差异是两个独立的东西，虽然都是由 Apache 实现的。

标签： java string similarity

【解决方案1】：

您的问题是时间复杂度之一。 Collections.sort() 是一个 O(n log n) 操作，这是调用 compare 方法的次数。问题是 Levenshtein 是一个“昂贵”的计算。

您可以通过找到一种方法来提高排序性能，方法是为每个项目精确计算一次，使 Levenshtein 计算成为 O(n) 操作，然后对存储的计算距离进行排序。

我做了一个测试，使用各种列表大小排序随机整数列表，实际调用compare()的次数非常接近n log₂ n，所以对于一个列表大约 1000 个字符串，它会快 10 倍左右，因为 log₂(1000) 大约是 10。

您可以通过不排序来进一步提高性能，而只需让最小项指定相同的比较器。

另一个改进是避免distinct() 调用（相对昂贵），通过使用 Set（强制唯一性）来保存候选人。

如果可以的话，用已经训练和小写的值填充候选，这样你就可以避免每次运行都修剪和小写和小写。输入相同的内容，这样您就可以使用equals() 而不是较慢的equalsIgnoreCase()。

这是一种方法：

import static org.apache.commons.lang.StringUtils.getLevenshteinDistance;

String search; // your input
Set<String> candidates = new HashSet<>(); // populate this with lots of values
Map<String, Integer> cache = new ConcurrentHashMap<>();
String closest = candidates.parallelStream()
    .map(String::trim)
    .filter(s -> !s.equalsIgnoreCase(search))
    .min((a, b) -> Integer.compare(
      cache.computeIfAbsent(a, k -> getLevenshteinDistance(search, k)),
      cache.computeIfAbsent(b, k -> getLevenshteinDistance(search, k))))
    .get();

对于 1000 个随机候选者，此代码的执行时间约为 50 毫秒，对于 100 万个候选者，此代码的执行时间约为 1 秒。

【讨论】：

我收到一个错误：HashCode 类型不是通用的；它不能用参数参数化
@userit1985 应该是HashSet。见编辑版本。我只是在火车上翻阅了这个，所以也可能有其他错别字。你知道如何用你的值填充 Set 吗？
我仍然在 'computeIfAbsent' 上出现错误我以这种方式生成了 Set： Set Candidates = new HashSet(Arrays.asList ("Main Color", "color" , "主题","颜色","主色","颜色")); // 用很多值填充它
@userit1985 我已经修复了轻微的语法错误。当使用一百万个随机候选字符串进行测试时，此代码在 700 毫秒内执行，1000 个候选字符串在 50 毫秒内执行
太棒了！但代码没有回答条件 5 和 6。字符串顺序：主颜色 = 颜色主要检查首字母缩写词：HP - Hewlett Packard 此外，首字母缩写词是随机的，我没有关闭列表。

【解决方案2】：

已编辑

我将 Bohemian 给出的答案包装到您原始代码的上下文中，以便您更好地理解。

.map(term -> Arrays.stream(term.split(" ")).sorted().collect(Collectors.joining(" "))) 行将多词项拆分、排序并再次连接以消除相同词的排列。这是对“主色”和“主色”等术语的置换相等性挑战的答案。

但是，在这个问题的上下文中捕获任务的所有业务需求是没有意义的。通过这个答案，您已经获得了解决方案的概要。效率问题得到解决。您的管道中可能需要更多阶段，但这是另一回事。该方法的优势在于所有阶段都是独立的，因此您可以针对每个阶段独立提出问题并寻求帮助。

public static String findSimilarity(String word, List<String> candidatesList) {

    // Populating the set with distinct values of the input terms
    Set<String> candidates = candidatesList.stream()
            .map(String::toLowerCase)
            .map(term -> Arrays.stream(term.split(" ")).sorted().collect(Collectors.joining(" "))) // eliminates permutations
            .collect(Collectors.toSet());

    Map<String, Integer> cache = new ConcurrentHashMap<>();

    return candidates.parallelStream()
            .map(String::trim)
                    // add more mappers if needed
            .filter(s -> !s.equalsIgnoreCase(word))
                    // add more filters if needed
            .min((a, b) -> Integer.compare(
                    cache.computeIfAbsent(a, k -> getLevenshteinDistance(word, k)),
                    cache.computeIfAbsent(b, k -> getLevenshteinDistance(word, k))))
            .get(); // get the closest match
}

【讨论】：

嗨，谢谢。但是首字母缩略词是随机的，我没有详细的列表。我通过“”分隔符将字符串拆分为单词并取第一个字符。此外，代码不响应请求 5. 检查字符串的顺序。
当然源不必是关闭列表。我只是跟着你的样本。如果拆分输入字符串，则可以执行Stream.of(input.split(" "))。您能否详细说明请求 5？您想如何对字符串进行精确排序？
能否将其添加到代码中？ 1.我想以给定的字符串为例：品牌兼容，并检查排序列表中是否存在“兼容品牌” 2.我想获取给定的字符串“品牌兼容”并检查排序列表中是否存在“BC "（首字母缩略词）谢谢！
首先，我不明白检查的标准是什么。那是在不同的排列中具有相同的单词吗？你只举一个例子。其次，我不明白您在检查条件后要做什么。假设您同时拥有“Compatible Brand”和“Brand Compatible”，您想用它们做什么？
是的。我想检查某些字符串中相同的标记是否以不同的顺序存在，而不是推荐的类似字符串。如：主色会被推荐为主色。品牌兼容将被推荐为兼容品牌。不同的排列将被推荐为相似的字符串。