有许多不同的字符串距离度量。我不确定如何在这种情况下使用余弦相似度,但我建议查看strsim 库。
我将举例说明如何使用最适合短字符串的Jaro-Winkler 度量来解决问题。
此外,我在包括我尝试使用
cosine similarity鉴于来自所述库的文档的示例。
可能是完全错误的,但应该给你一个关于如何从两列不同长度的笛卡尔级数制作dataframe的一般思想,以及如何将strsim 987654325 @ 987654326应用于数据的算法@
数据准备:
import pandas as pd
from similarity.jarowinkler import JaroWinkler
from similarity.cosine import Cosine
df1 = pd.DataFrame({
"name": ["mahesh", "suresh"]
})
df2 = pd.DataFrame({
"name": ["mahesh", "surendra", "shrivatsa", "suresh", "maheshwari"]
})
df = pd.MultiIndex.from_product(
[df1["name"], df2["name"]], names=["col1", "col2"]
).to_frame(index=False)
返回:
col1 col2
0 mahesh mahesh
1 mahesh surendra
2 mahesh shrivatsa
3 mahesh suresh
4 mahesh maheshwari
5 suresh mahesh
6 suresh surendra
7 suresh shrivatsa
8 suresh suresh
9 suresh maheshwari
雅罗-温克勒:
jarowinkler = JaroWinkler()
df["jarowinkler_sim"] = [jarowinkler.similarity(i,j) for i,j in zip(df["col1"],df["col2"])]
返回:
col1 col2 jarowinkler_sim
0 mahesh mahesh 1.0
1 mahesh surendra 0.4305555555555555
2 mahesh shrivatsa 0.5185185185185185
3 mahesh suresh 0.6666666666666666
4 mahesh maheshwari 0.9466666666666667
5 suresh mahesh 0.6666666666666666
6 suresh surendra 0.8333333333333334
7 suresh shrivatsa 0.611111111111111
8 suresh suresh 1.0
9 suresh maheshwari 0.48888888888888893
余弦相似度:
cosine = Cosine(2)
df["p0"] = df["col1"].apply(lambda s: cosine.get_profile(s))
df["p1"] = df["col2"].apply(lambda s: cosine.get_profile(s))
df["cosine_sim"] = [cosine.similarity_profiles(p0,p1) for p0,p1 in zip(df["p0"],df["p1"])]
df.drop(["p0", "p1"], axis=1)
返回:
col1 col2 cosine_sim
0 mahesh mahesh 0.9999999999999998
1 mahesh surendra 0.0
2 mahesh shrivatsa 0.15811388300841897
3 mahesh suresh 0.3999999999999999
4 mahesh maheshwari 0.7453559924999299
5 suresh mahesh 0.3999999999999999
6 suresh surendra 0.5070925528371099
7 suresh shrivatsa 0.15811388300841897
8 suresh suresh 0.9999999999999998
9 suresh maheshwari 0.29814239699997197