python - 计算列表中单词之间的拼写相似度答案

【问题标题】：python - calculate orthographic similarity between words of a listpython - 计算列表中单词之间的拼写相似度
【发布时间】：2018-05-20 17:23:03
【问题描述】：

我需要计算给定语料库中单词之间的正交相似性（编辑/Levenshtein 距离）。

正如基里尔在下面建议的那样，我尝试执行以下操作：

import csv, itertools, Levenshtein
import numpy as np

# import the list of words from csv file
path = '/Users/my path'
file = path + 'file.csv'

with open(file, 'rb') as f:
    reader = csv.reader(f)
    wordlist = list(reader)

wordlist = np.array(wordlist) #make it a np array
wordlist2 = wordlist[:,0] #subset the first column of the imported list

for a, b in itertools.product(wordlist, wordlist):
    if a < b:
        print(a, b, Levenshtein.distance(a, b))

但是，弹出以下错误：

ValueError：具有多个元素的数组的真值不明确。使用 a.any() 或 a.all()

我理解代码中的歧义，但有人可以帮我弄清楚如何解决这个问题吗？谢谢！

【问题讨论】：

这些例子有帮助吗？ rosettacode.org/wiki/Levenshtein_distance#Python
关于你的新代码，我看到 test.csv 只包含 test\n: wordlist = list(reader) # Error: iterator should return strings, not bytes (did you open the file in text mode?)。因此，由于省略了堆栈跟踪，因此您的错误并不清楚。我的建议是逐步处理整个事情：首先（使用单独的程序）准备一个仅包含换行符分隔单词的文本文件，检查文件，然后使用words = sorted(set(s.strip() for s in open(filename))运行我的答案中的代码。
知道了！谢谢！！

标签： python arrays numpy itertools levenshtein-distance

【解决方案1】：

根据其定义，Levenshtein 距离只能在两个字符串之间计算：这是您可以编辑一个字符串以获得另一个字符串的方式。您可以成对比较单词，它需要n*(n-1)/2 比较（其中n 是您的语料库中唯一单词的数量）。你可以这样做：

>>> import itertools, Levenshtein
>>> words = sorted(set('little Mary had a little lamb'.split()))
>>> for a, b in itertools.product(words, words):
...     if a < b:
...         print(a, b, Levenshtein.distance(a, b))
... 
Mary a 3
Mary had 3
Mary lamb 3
Mary little 6
a had 2
a lamb 3
a little 6
had lamb 3
had little 6
lamb little 5

【讨论】：

因此，如果我有一个单词列表，我应该能够使用您的代码，以便为每个可能的单词对计算距离。对吗？
@RobertP。正确。
嗨@Kirill，我试着按照你的建议去做，但我似乎没有成功——你介意帮忙吗？在问题中查看更多详细信息。

【解决方案2】：

这是我在 Kirill 的帮助下想出的代码。

import csv#, StringIO
import itertools, Levenshtein

# open the newline-separated list of words
path = '/Users/your path'
file = path + 'wordlists.txt'
output = path + 'ortho_similarities.txt'
words = sorted(set(s.strip() for s in open(file)))

# the following loop take all possible pairwise combinations
# of the words in the list words, and calculate the LD
# and then let's write everything in a csv file
with open(output, 'wb') as f:
   writer = csv.writer(f, delimter=",", lineterminator="\n")
   for a, b in itertools.product(words, words):
      if a < b:
        write.writerow([a, b, Levenshtein.distance(a,b)])

【讨论】：