用重音编辑距离答案

【问题标题】：Edit Distance with accents用重音编辑距离
【发布时间】：2023-07-28 21:36:02
【问题描述】：

python 中是否有一些考虑重音的编辑距离。例如在哪里持有以下财产

d('ab', 'ac') > d('àb', 'ab') > 0

【问题讨论】：

不会在两个字符串中都用非重音替换重音字母，然后计算距离工作吗？
我同意这一点。使用 Unidecode 可能会有所帮助：pypi.python.org/pypi/Unidecode/0.04.1
好的，谢谢，但此时我有 d('àa','aa') = 0。
问题出在哪里？您不知道如何判断给定角色是否是另一个角色的“重音”版本，或者如何将这一事实整合到距离本身？（或两者兼而有之？）
@vigte，那你想要什么值？

标签： python edit-distance

【解决方案1】：

用Levenshtein module:

In [1]: import unicodedata, string

In [2]: from Levenshtein import distance

In [3]: def remove_accents(data):
   ...:     return ''.join(x for x in unicodedata.normalize('NFKD', data)
   ...:                             if x in string.ascii_letters).lower()

In [4]: def norm_dist(s1, s2):
   ...:     norm1, norm2 = remove_accents(s1), remove_accents(s2)
   ...:     d1, d2 = distance(s1, s2), distance(norm1, norm2)
   ...:     return (d1+d2)/2.

In [5]: norm_dist(u'ab', u'ac')
Out[5]: 1.0

In [6]: norm_dist(u'àb', u'ab')
Out[6]: 0.5

【讨论】：

【解决方案2】：

Unicode 允许将重音字符分解为基本字符和组合重音字符；例如à 分解为 a，后跟重音组合。

您想使用规范化形式 NFKD 转换两个字符串，它分解重音字符并将兼容性字符转换为其规范形式，然后使用编辑距离度量将替换排在插入和删除之上。

【讨论】：

【解决方案3】：

这是一个基于difflib 和unicodedata 的解决方案，没有任何依赖关系：

import unicodedata
from difflib import Differ

# function taken from https://*.com/a/517974/1222951
def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore').decode()
    return only_ascii

def compare(wrong, right):
    # normalize both strings to make sure equivalent (but
    # different) unicode characters are canonicalized 
    wrong = unicodedata.normalize('NFKC', wrong)
    right = unicodedata.normalize('NFKC', right)

    num_diffs = 0
    index = 0
    differences = list(Differ().compare(wrong, right))
    while True:
        try:
            diff = differences[index]
        except IndexError:
            break

        # diff is a string like "+ a" (meaning the character "a" was inserted)
        # extract the operation and the character
        op = diff[0]
        char = diff[-1]

        # if the character isn't equal in both
        # strings, increase the difference counter
        if op != ' ':
            num_diffs += 1

        # if a character is wrong, there will be two operations: one
        # "+" and one "-" operation
        # we want to count this as a single mistake, not as two mistakes
        if op in '+-':
            try:
                next_diff = differences[index+1]
            except IndexError:
                pass
            else:
                next_op = next_diff[0]
                if next_op in '+-' and next_op != op:
                    # skip the next operation, we don't want to count
                    # it as another mistake
                    index += 1

                    # we know that the character is wrong, but
                    # how wrong is it?
                    # if the only difference is the accent, it's
                    # a minor mistake
                    next_char = next_diff[-1]
                    if remove_accents(char) == remove_accents(next_char):
                        num_diffs -= 0.5

        index += 1

    # output the difference as a ratio of
    # (# of wrong characters) / (length of longest input string)
    return num_diffs / max(len(wrong), len(right))

测试：

for w, r in (('ab','ac'),
            ('àb','ab'),
            ('être','etre'),
            ('très','trés'),
            ):
    print('"{}" and "{}": {}% difference'.format(w, r, compare(w, r)*100))

"ab" and "ac": 50.0% difference
"àb" and "ab": 25.0% difference
"être" and "etre": 12.5% difference
"très" and "trés": 12.5% difference

【讨论】：