如何找到两个多行字符串之间的相似度百分比？答案

【问题标题】：How do I find the percentage of similarity between two multiline Strings?如何找到两个多行字符串之间的相似度百分比？
【发布时间】：2017-05-17 04:03:32
【问题描述】：

我有两个多行字符串。我正在使用以下代码来确定其中两个之间的相似性。这利用了 Levenshtein 距离算法。

  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { 
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }

    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

但是上面的代码没有按预期工作。

例如，假设我们有以下两个字符串，例如 s1 和 s2，

S1 -> How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both?

S2->How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both?

然后我将上述字符串传递给相似性方法，但它没有找到确切的差异百分比。如何优化算法？

以下是我的主要方法

更新：

public static boolean authQuestion(String question) throws SQLException{


        boolean isQuestionAvailable = false;
        Connection dbCon = null;
        try {
            dbCon = MyResource.getConnection();
            String query = "SELECT * FROM WORDBANK where WORD ~*  ?;";
            PreparedStatement checkStmt = dbCon.prepareStatement(query);
            checkStmt.setString(1, question);
            ResultSet rs = checkStmt.executeQuery();
            while (rs.next()) {
                double re=similarity( rs.getString("question"), question);
                if(re  > 0.6){
                    isQuestionAvailable = true;
                }else {
                    isQuestionAvailable = false;
                }
            }
        } catch (URISyntaxException e1) {
            e1.printStackTrace();
        } catch (SQLException sqle) {
            sqle.printStackTrace();
        } catch (Exception e) {
            if (dbCon != null)
                dbCon.close();
        } finally {
            if (dbCon != null)
                dbCon.close();
        }

        return isQuestionAvailable;
    }

【问题讨论】：

看看Apache's implementation有没有什么想法。
那么，你得到了多少百分比，你期望得到什么，为什么？另外，“优化算法”是什么意思？优化性能，还是您的意思是“修复”它，直到它达到您的预期？
修复它unitl我得到了我想要的。它始终打印 100%
我需要注意您的代码不会打印任何内容。我刚刚尝试了您在代码中出现的两个字符串，它给出了96.94656488549618%，所以不是 100%。从中我得出结论，问题可能出在您用于打印输出的代码中，或者您可能没有正确运行它。请包括您的main 方法。
我不确定您打算如何处理 SQL 查询。如果您使用 S1 字符串进行搜索，您将不会在数据库中找到您的 S2。您在查询中使用的 ~* 运算符是 postgresql 不区分大小写的正则表达式匹配运算符，但您传入的字符串不是正则表达式。因此，如果它在数据库中找不到匹配项，则永远不会进入您的 while 循环，并且 isQuestionAvailable 仍然是 false。

标签： java algorithm levenshtein-distance

【解决方案1】：

由于您在 sql 查询的 where 子句 中使用了整个 S1，因此它要么找到完美匹配，要么根本不返回任何结果。

正如@ErwinBolwidt 所述，如果它什么都不返回，那么您isQuestionAvailable 将始终保持false。如果它返回一个完美匹配，那么你一定会得到100% 相似度。

您可以做的是：使用 S1 的子字符串 搜索与该部分匹配的问题。

您可以进行以下更改：

authQuestion method

checkStmt.setString(1, question.substring(0,20)); //say

在提取的结果中，您可以将每个结果与您的问题进行比较。

【讨论】：

【解决方案2】：

您的similarity 方法返回一个介于 0 和 1 之间（包括两端）的数字，其中 1 表示字符串相同（编辑距离为零）。

但是，在您的 authQuestion 方法中，您的行为就好像它返回了一个介于 0 和 100 之间的数字，这一行就是证明：

if(re > 60){

您需要将其更改为

if(re > .6){

或者去

if(re * 100 > 60){

【讨论】：

我试过了，但还是没有按预期工作
我用过 re ? .6 但循环总是返回 false。当且仅代码没有进入 while 循环时，它返回 true

【解决方案3】：

我可以建议你一种方法...

您正在使用编辑距离，它为您提供了 S1 中的字符数，您需要更改/添加/删除才能将其转换为 S2。

所以，例如：

S1 = "abc"
S2 = "cde"

编辑距离为 3，它们 100% 不同（考虑到您在某种逐字符比较中看到它）。

所以你可以有一个大概的百分比，如果你这样做了

S1 = "abc"
S2 = "cde"
edit = edit_distance(S1, S2)
percentage = min(edit/S1.length(), edit/S2.length())

min 是一种处理字符串非常不同的情况的解决方法，例如：

S1 = "abc"
S2 = "defghijklmno"

所以编辑距离会大于S1的长度，百分比应该大于100%，所以也许除以更大的尺寸应该更好。

希望对你有帮助

【讨论】：