编辑距离递归算法——Skiena答案

【问题标题】：Edit distance recursive algorithm -- Skiena编辑距离递归算法——Skiena
【发布时间】：2013-10-13 14:46:21
【问题描述】：

我正在阅读 Steven Skiena 的《算法设计手册》，并且正在阅读动态编程章节。他有一些编辑距离的示例代码，并使用了一些既没有在书中也没有在互联网上解释的功能。所以我想知道

a) 这个算法是如何工作的？

b) indel 和 match 函数有什么作用？

#define MATCH     0       /* enumerated type symbol for match */
#define INSERT    1       /* enumerated type symbol for insert */
#define DELETE    2       /* enumerated type symbol for delete */

int string_compare(char *s, char *t, int i, int j)
{
        int k;                  /* counter */
        int opt[3];             /* cost of the three options */
        int lowest_cost;        /* lowest cost */

        if (i == 0) return(j * indel(' '));
        if (j == 0) return(i * indel(' '));

        opt[MATCH] = string_compare(s,t,i-1,j-1) + match(s[i],t[j]);
        opt[INSERT] = string_compare(s,t,i,j-1) + indel(t[j]);
        opt[DELETE] = string_compare(s,t,i-1,j) + indel(s[i]);

        lowest_cost = opt[MATCH];
        for (k=INSERT; k<=DELETE; k++)
                if (opt[k] < lowest_cost) lowest_cost = opt[k];

        return( lowest_cost );
}

【问题讨论】：

标签： c++ c algorithm dynamic-programming

【解决方案1】：

书中对它们进行了解释。请阅读8.2.4 编辑距离的多样性

部分

【讨论】：

【解决方案2】：

基本上，它利用动态规划方法解决问题，将问题的解决方案构造为子问题的解决方案，以避免自下而上或自上而下的重新计算。

问题的递归结构如给定here，其中i,j分别是两个字符串中的开始（或结束）索引。

这是this page 的摘录，很好地解释了算法。

问题：给定两个大小为 m、n 的字符串和一组操作替换 (R)、插入 (I) 和删除 (D) 都以相同的成本进行。查找最小数量将一个字符串转换为另一个字符串所需的编辑（操作）。

识别递归方法：

在这种情况下会有什么子问题？考虑查找编辑距离部分字符串，比如小前缀。让我们将它们表示为 [1...i] 和 [1...j] 对于一些 1

在前缀中，我们可以通过三种方式（i，-）来右对齐字符串， (-, j) 和 (i, j)。连字符 (-) 表示无字符。一个例子可以更清楚。

给定字符串 SUNDAY 和 SATURDAY。我们想将 SUNDAY 转换为星期六，编辑最少。让我们选择 i = 2 和 j = 4 即前缀字符串分别是 SUN 和 SATU（假设字符串索引从 1) 开始。最右边的字符可以三个对齐不同的方式。

案例 1：对齐字符 U 和 U。它们是相等的，不需要编辑。我们仍然留下了 i = 1 和 j = 3, E(i-1, j-1) 的问题。

案例 2: 对齐第一个字符串的右字符并且没有字符第二个字符串。我们需要在这里删除（D）。我们仍然带着问题离开 i = 1 和 j = 4，E(i-1, j)。

案例 3：从第二个字符串中对齐右字符并且没有从第二个字符串中的字符第一个字符串。我们需要在这里插入（I）。我们还是带着 i = 2 和 j = 3, E(i, j-1) 的问题。

结合所有子问题对齐前缀字符串的最小成本以 i 和 j 结束

E(i, j) = min( [E(i-1, j) + D], [E(i, j-1) + I], [E(i-1, j-1) +如果 i,j 字符不相同])

我们还没有完成。基本情况是什么？

当两个字符串的大小都为 0 时，成本为 0。当只有一个的字符串为零，我们需要编辑操作作为非零长度字符串。数学上，

E(0, 0) = 0, E(i, 0) = i, E(0, j) = j

我建议通过this lecture 获得一个很好的解释。

match() 函数返回 1，如果两个字符不匹配（以便在最终答案中增加一个动作），否则返回 0。

【讨论】：

【解决方案3】：

在书中的第 287 页：

int match(char c, char d)
{
  if (c == d) return(0); 
  else return(1); 
}

int indel(char c)
{
  return(1);
}

【讨论】：

如果 indel 函数总是返回 1，它的意义何在？我们不妨改用1。
@JanacMeena，这有什么意义？可读性。字面“1”只是一个数字，不同的1字面可以有不同的示意图；但是“indel(...)”显然是插入/删除的成本（恰好是一个，但以后可以用其他任何东西代替）。
很公平，可以说这个问题存在 9000+ 视图的事实可能表明 indel() 函数降低了可读性，但这主要是由于在示例中使用之前没有在教科书中定义方法.

【解决方案4】：

请通过此链接： https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/styles/pages/editdistance.html

实现上述算法的代码是：

int dpEdit(char *s1, char *s2 ,int len1,int len2)
{
if(len1==0)  /// Base Case
return len2;
else if(len2==0)
return len1;
else
{
    int add, remove,replace;
    int table[len1+1][len2+2];
    for(int i=0;i<=len2;i++)
    table[0][i]=i;
    for(int i=0;i<=len1;i++)
    table[i][0]=i;
    for(int i=1;i<=len1;i++)
    {
        for(int j=1;j<=len2;j++)
        {
          // Add 
          //
          add = table[i][j-1]+1;  
          remove = table[i-1][j]+1;
          if(s1[i-1]!=s2[j-1])
          replace = table[i-1][j-1]+1;
          else
          replace =table[i-1][j-1];
          table[i][j]= min(min(add,remove),replace); // Done :)

        }
    }

【讨论】：

【解决方案5】：

这是一种递归算法，不是动态编程。注意，算法开始时，i & j 分别指向 s & t 的最后一个字符。

indel 返回 1。 match(a, b) 如果 a = b（匹配）则返回 0，否则返回 1（替换）

#define MATCH     0       /* enumerated type symbol for match */
#define INSERT    1       /* enumerated type symbol for insert */
#define DELETE    2       /* enumerated type symbol for delete */

int string_compare(char *s, char *t, int i, int j)
{
    int k;                  /* counter */
    int opt[3];             /* cost of the three options */
    int lowest_cost;        /* lowest cost */

    // base case, if i is 0, then we reached start of s and 
    // now it's empty, so there would be j * 1 edit distance between s & t
    // think of it if s is initially empty and t is not, how many
    // edits we need to perform on s to be similar to t? answer is where
    // we are at t right now which is j
    if (i == 0) return(j * indel(' '));
    // same reasoning as above but for s instead of t
    if (j == 0) return(i * indel(' '));

    // calculate opt[match] by checking if s[i] = t[j] which = 0 if true or 1 if not
    // then recursively do the same for s[i-1] & t[j-1]
    opt[MATCH] = string_compare(s,t,i-1,j-1) + match(s[i],t[j]);
    // calculate opt[insert] which is how many chars we need to insert 
    // in s to make it looks like t, or look at it from the other way,
    // how many chars we need to delete from t to make it similar to s?
    // since we're deleting from t, we decrease j by 1 and leave i (pointer
    // in s) as is + indel(t[j]) which we deleted (always returns 1)
    opt[INSERT] = string_compare(s,t,i,j-1) + indel(t[j]);
    // same reasoning as before but deleting from s or inserting into t
    opt[DELETE] = string_compare(s,t,i-1,j) + indel(s[i]);

    // these lines are just to pick the min of opt[match], opt[insert], and
    // opt[delete]
    lowest_cost = opt[MATCH];
    for (k=INSERT; k<=DELETE; k++)
            if (opt[k] < lowest_cost) lowest_cost = opt[k];

    return( lowest_cost );
}

该算法并不难理解，您只需阅读几遍即可。总是让我感到有趣的是发明它的人以及递归会做正确事情的信任。

【讨论】：

【解决方案6】：

到目前为止，这对于 OP 来说可能不是问题，但我会写下我对文本的理解。

/**
 * Returns the cost of a substitution(match) operation
 */
int match(char c, char d)
{
  if (c == d) return 0
  else return 1
}

/**
 * Returns the cost of an insert/delete operation(assumed to be a constant operation)
 */
int indel(char c)
{
  return 1
}

编辑距离本质上是对给定字符串的最小修改次数，需要将其转换为另一个参考字符串。如您所知，可以进行以下修改。

替换（替换单个字符）
插入（在字符串中插入单个字符）
删除（从字符串中删除单个字符）

现在，

正确地提出字符串相似性问题需要我们设置每个字符串转换操作的成本。为每个操作分配相等的成本 1 定义了两个字符串之间的编辑距离。

这样就确定了我们已知的三个修改中的每一个都有一个恒定的成本，O(1)。

但是我们怎么知道在哪里修改呢？

相反，我们从字符串末尾逐个字符地查找可能需要或不需要的修改。所以，

我们计算所有替换操作，从字符串末尾开始
我们统计所有删除操作，从字符串末尾开始
我们统计所有插入操作，从字符串末尾开始

最后，一旦我们有了这些数据，我们就返回上述三个和中的最小值。

【讨论】：