【发布时间】:2020-12-10 06:06:01
【问题描述】:
这个对我来说有点棘手。
数据框:
parent children
0 MAX [MAX, amx, akd]
1 Sam ['Sam','sammy','samsam']
2 Larry ['lar','lair','larrylamo']
我有一个函数,如果我只传入一个字符串,它将比较两个字符串,并打印出一个描述字符(距离)有多近的数字。类似于 levenshtein 方程。
我如何在数据帧上运行这个函数呢?我需要将第一列(“父”)中的每条记录与第二列(“子”)中的相应列表进行比较?
目前,我可以运行它并获得以下结果:
>>> reference = 'larry'
>>> value_list = ['lar','lair','larrylamo']
>>> get_top_matches(reference,value_list)
>>> [('lar',0.91),('larrylamo',0.91),('lair',0.83)]
我正在尝试为匹配的每一行创建由元组组成的第三列,如下所示:
parent children func_results
0 MAX [MAX, amx, akd] [('MAX',1.0),('amx',0.89),('akd',0.56)]
1 Sam ['Sam','sammy','samsam'] [('Sam',1.0),('sammy',0.91), ('samsam',0.88)]
2 Larry ['lar','lair','larrylamo'] [('lar',0.91),('larrylamo',0.91), ('lair',0.83)]
我认为该函数应该能够按原样工作,如果我能弄清楚如何在针对 df 的 for 循环中应用它。
以下是函数:
import math
import re
def sort_token_alphabetically(word):
token = re.split('[,. ]', word)
sorted_token = sorted(token)
return ' '.join(sorted_token)
def get_jaro_distance(first, second, winkler=True, winkler_ajustment=True,
scaling=0.1, sort_tokens=True):
if sort_tokens:
first = sort_token_alphabetically(first)
second = sort_token_alphabetically(second)
if not first or not second:
raise JaroDistanceException(
"Cannot calculate distance from NoneType ({0}, {1})".format(
first.__class__.__name__,
second.__class__.__name__))
jaro = _score(first, second)
cl = min(len(_get_prefix(first, second)), 4)
if all([winkler, winkler_ajustment]): # 0.1 as scaling factor
return round((jaro + (scaling * cl * (1.0 - jaro))) * 100.0) / 100.0
return jaro
def _score(first, second):
shorter, longer = first.lower(), second.lower()
if len(first) > len(second):
longer, shorter = shorter, longer
m1 = _get_matching_characters(shorter, longer)
m2 = _get_matching_characters(longer, shorter)
if len(m1) == 0 or len(m2) == 0:
return 0.0
return (float(len(m1)) / len(shorter) +
float(len(m2)) / len(longer) +
float(len(m1) - _transpositions(m1, m2)) / len(m1)) / 3.0
def _get_diff_index(first, second):
if first == second:
pass
if not first or not second:
return 0
max_len = min(len(first), len(second))
for i in range(0, max_len):
if not first[i] == second[i]:
return i
return max_len
def _get_prefix(first, second):
if not first or not second:
return ""
index = _get_diff_index(first, second)
if index == -1:
return first
elif index == 0:
return ""
else:
return first[0:index]
def _get_matching_characters(first, second):
common = []
limit = math.floor(min(len(first), len(second)) / 2)
for i, l in enumerate(first):
left, right = int(max(0, i - limit)), int(
min(i + limit + 1, len(second)))
if l in second[left:right]:
common.append(l)
second = second[0:second.index(l)] + '*' + second[
second.index(l) + 1:]
return ''.join(common)
def _transpositions(first, second):
return math.floor(
len([(f, s) for f, s in zip(first, second) if not f == s]) / 2.0)
def get_top_matches(reference, value_list, max_results=None):
scores = []
if not max_results:
max_results = len(value_list)
for val in value_list:
score_sorted = get_jaro_distance(reference, val)
score_unsorted = get_jaro_distance(reference, val, sort_tokens=False)
scores.append((val, max(score_sorted, score_unsorted)))
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:max_results]
class JaroDistanceException(Exception):
def __init__(self, message):
super(Exception, self).__init__(message)
reference = 'larry'
value_list = ['lar','lair','larrylamo']
get_top_matches(reference, value_list)
【问题讨论】:
-
您的功能是什么,您尝试过什么(使用代码),结果如何?我们要求本网站上的问题包括可以运行和测试的minimal reproducible example
-
你的比较函数是什么样的?
-
@G.Anderson 我已经添加了函数的可执行版本。
-
@SeyiDaniel 刚刚将功能添加到问题中。刷新
标签: python pandas fuzzy-comparison