两个大小不等的数据框之间的文本相似性答案

【问题标题】：Text Similarity between two dataframe of unequal size两个大小不等的数据框之间的文本相似性
【发布时间】：2021-05-15 12:09:27
【问题描述】：

我有两个包含 id 和嵌入文本的数据框，我想检查跨数据框的相似性。 data1(2000)的长度小于data2(50万)。

我想要每一行与 data2 的所有行之间的相似性，例如 data1 的 row1 与 data2 的所有行以及 data1 的 row2 与 data2 的所有行之间的相似度，依此类推。

对于每次迭代，我都想存储两列中的最佳匹配和 ID。

data1
ID_1, title_embeddings
1, 'dbhbhbc jcdwc dnwc, 0.5, 0.6, 0.8, 0.8
2, 'hbwdbhbc jcdwc dnwc, 0.15, 0.65, 0.85, 0.348
..

data2
ID_2, text1, tweet_embeddings
1, 'dbhbc jcdwc dnwc, '0.5, 0.6, 0.8, 0.8
2, 'dbhbc jcdwc dnwc, 0.15, 0.65, 0.85, 0.348
3, 'dbhbnec jcdwc dnwc, 0.565, 0.346, 0.28, 0.18
4, 'dbhbc jcdwc dnwc, 0.165, 0.365, 0.785, 0.348

X=data2['title_embeddings']
Y=data1.head()

from sklearn.metrics.pairwise import cosine_similarity
from scipy import spatial

for i, row in Y.iterrows():
   print('number ' +str(i))
   sim_score=[]
   for j in range(0,len(X)):
       a= 1 - spatial.distance.cosine(row['tweet_embeddings'], X[j])
       sim_score.append(a)
       print(max(sim_score))

Expected output
ID_1, ID_2, tweet_embeddings, sim_score
1      4    'dbhbc jcdwc dnwc, 0.5

目前，我无法通过我的方法找到结果。

【问题讨论】：

标签： python text nlp

【解决方案1】：

预期输出首先需要相似度得分，可以如下计算。但是，在data1中，是不是第三行的idf分数？

from sklearn.metrics.pairwise import linear_kernel

df_cos_sim = pd.DataFrame(columns=df.index)
res = data1.tweet_embeddings # this needs to be in matrix form

# 0.5 million rows of data2
for i in range(500000):
    current_cosine_sim = linear_kernel(res[i:i+1], res).flatten()
    df_cos_sim[i] = current_cosine_sim
    
df_cos_sim

【讨论】：