首先,你应该在字符串之间使用 Levenshtein 距离,我找到了一个链接,链接如下Levenshtein Distance Algorithm for Python:
# Define Levenshtein distance function (from the mentioned link)
def levenshtein(s1, s2):
if len(s1) < len(s2):
return levenshtein(s2, s1)
if len(s2) == 0:
return len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
一旦你得到这个,你应该使一个函数能够找到给定字符串和具有正确拼写名称的列表之间最接近的匹配项。
names_list = ['bercelona', 'emstrdam', 'Praga']
good_names = ['New York', 'Amsterdam', 'Barcelona', 'Berlin', 'Prague']
# Define a function that returns the best match
def get_closest_match(name, real_names):
levdist = [levenshtein(name, real_name) for real_name in real_names]
for i in range(len(levdist)):
if levdist[i] == min(levdist):
return real_names[i]
# Loops the first list
final_list = [ get_closest_match(name, good_names) for name in names_list ]
最后你只需要用这个函数循环第一个列表。结果如下:
>>> print final_list
['Barcelona', 'Amsterdam', 'Prague']