【问题标题】:Edit Distance with accents用重音编辑距离
【发布时间】:2023-07-28 21:36:02
【问题描述】:

python 中是否有一些考虑重音的编辑距离。 例如在哪里持有以下财产

d('ab', 'ac') > d('àb', 'ab') > 0

【问题讨论】:

  • 不会在两个字符串中都用非重音替换重音字母,然后计算距离工作吗?
  • 我同意这一点。使用 Unidecode 可能会有所帮助:pypi.python.org/pypi/Unidecode/0.04.1
  • 好的,谢谢,但此时我有 d('àa','aa') = 0。
  • 问题出在哪里?您不知道如何判断给定角色是否是另一个角色的“重音”版本,或者如何将这一事实整合到距离本身?(或两者兼而有之?)
  • @vigte,那你想要什么值?

标签: python edit-distance


【解决方案1】:

Levenshtein module:

In [1]: import unicodedata, string

In [2]: from Levenshtein import distance

In [3]: def remove_accents(data):
   ...:     return ''.join(x for x in unicodedata.normalize('NFKD', data)
   ...:                             if x in string.ascii_letters).lower()

In [4]: def norm_dist(s1, s2):
   ...:     norm1, norm2 = remove_accents(s1), remove_accents(s2)
   ...:     d1, d2 = distance(s1, s2), distance(norm1, norm2)
   ...:     return (d1+d2)/2.

In [5]: norm_dist(u'ab', u'ac')
Out[5]: 1.0

In [6]: norm_dist(u'àb', u'ab')
Out[6]: 0.5

【讨论】:

    【解决方案2】:

    Unicode 允许将重音字符分解为基本字符和组合重音字符;例如à 分解为 a,后跟重音组合。

    您想使用规范化形式 NFKD 转换两个字符串,它分解重音字符并将兼容性字符转换为其规范形式,然后使用编辑距离度量将替换排在插入和删除之上。

    【讨论】:

      【解决方案3】:

      这是一个基于difflibunicodedata 的解决方案,没有任何依赖关系:

      import unicodedata
      from difflib import Differ
      
      # function taken from https://*.com/a/517974/1222951
      def remove_accents(input_str):
          nfkd_form = unicodedata.normalize('NFKD', input_str)
          only_ascii = nfkd_form.encode('ASCII', 'ignore').decode()
          return only_ascii
      
      def compare(wrong, right):
          # normalize both strings to make sure equivalent (but
          # different) unicode characters are canonicalized 
          wrong = unicodedata.normalize('NFKC', wrong)
          right = unicodedata.normalize('NFKC', right)
      
          num_diffs = 0
          index = 0
          differences = list(Differ().compare(wrong, right))
          while True:
              try:
                  diff = differences[index]
              except IndexError:
                  break
      
              # diff is a string like "+ a" (meaning the character "a" was inserted)
              # extract the operation and the character
              op = diff[0]
              char = diff[-1]
      
              # if the character isn't equal in both
              # strings, increase the difference counter
              if op != ' ':
                  num_diffs += 1
      
              # if a character is wrong, there will be two operations: one
              # "+" and one "-" operation
              # we want to count this as a single mistake, not as two mistakes
              if op in '+-':
                  try:
                      next_diff = differences[index+1]
                  except IndexError:
                      pass
                  else:
                      next_op = next_diff[0]
                      if next_op in '+-' and next_op != op:
                          # skip the next operation, we don't want to count
                          # it as another mistake
                          index += 1
      
                          # we know that the character is wrong, but
                          # how wrong is it?
                          # if the only difference is the accent, it's
                          # a minor mistake
                          next_char = next_diff[-1]
                          if remove_accents(char) == remove_accents(next_char):
                              num_diffs -= 0.5
      
              index += 1
      
          # output the difference as a ratio of
          # (# of wrong characters) / (length of longest input string)
          return num_diffs / max(len(wrong), len(right))
      

      测试:

      for w, r in (('ab','ac'),
                  ('àb','ab'),
                  ('être','etre'),
                  ('très','trés'),
                  ):
          print('"{}" and "{}": {}% difference'.format(w, r, compare(w, r)*100))
      
      "ab" and "ac": 50.0% difference
      "àb" and "ab": 25.0% difference
      "être" and "etre": 12.5% difference
      "très" and "trés": 12.5% difference
      

      【讨论】: