【问题标题】:Cosine Similarity rows in a dataframe of pandas熊猫数据框中的余弦相似度行
【发布时间】:2021-03-31 18:21:13
【问题描述】:

我有一个 CSV 文件,其内容如下,我想从 CSV 文件中剩余的 ID 之一计算余弦相似度。

我已将其加载到 pandas 的数据框中,如下所示:

    old_df['Vector']=old_df.apply(lambda row: 
    np.array(np.matrix(row.Vector)).ravel(), axis = 1) 
    l=[]
    for a in old_df['Vector']:
        l.append(a)
    A=np.array(l)
    similarities = cosine_similarity(A)

输出看起来不错。但是,我不知道如何找到与其他 GUID(或 ID)相似的 GUID(或 ID),我只想得到前 k 个具有最高相似分数。

你能帮我解决这个问题吗?

谢谢。

|Index  |  GUID | Vector                                |
|-------|-------|---------------------------------------|
|36099  | b770  |[-0.04870541 -0.02133574  0.03180726]  |
|36098  | 808f  |[  0.0732905  -0.05331331  0.06378368] |
|36097  | b111  |[ 0.01994788  0.00417582 -0.09615131]  |
|36096  | b6b5  |[0.025697   -0.08277534 -0.0124591]    |
|36083  | 9b07  |[ 0.025697   -0.08277534 -0.0124591]   |
|36082  | b9ed  |[-0.00952298  0.06188576 -0.02636449]  |
|36081  | a5b6  |[0.00432161  0.02264584 -0.0341924]    |
|36080  | 9891  |[ 0.08732156  0.00649456 -0.02014138]  |
|36079  | ba40  |[0.05407356 -0.09085554 -0.07671648]   |
|36078  | 9dff  |[-0.09859556  0.04498474 -0.01839088]  |
|36077  | a423  |[-0.06124249  0.06774347 -0.05234318]  |
|36076  | 81c4  |[0.07278682 -0.10460124 -0.06572364]   |
|36075  | 9f88  |[0.09830415  0.05489364 -0.03916228]   |
|36074  | adb8  |[0.03149953 -0.00486591  0.01380711]   |
|36073  | 9765  |[0.00673934  0.0513557  -0.09584251]   |
|36072  | aff4  |[-0.00097896  0.0022945   0.01643319]  |

【问题讨论】:

    标签: python-3.x pandas dataframe cosine-similarity


    【解决方案1】:

    获取前 k 个余弦相似度及其对应的 GUID 和行 ID 的示例代码:

    import numpy as np
    import pandas as pd
    from sklearn.metrics.pairwise import cosine_similarity
    
    data = {"GUID": ["b770", "808f", "b111"], "Vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
    df = pd.DataFrame(data)
    print("Data: \n{}\n".format(df))
    
    vectors = []
    for v in df['Vector']:
        vectors.append(v)
    vectors_num = len(vectors)
    A=np.array(vectors)
    # Get similarities matrix
    similarities = cosine_similarity(A)
    similarities[np.tril_indices(vectors_num)] = -2
    print("Similarities: \n{}\n".format(similarities))
    
    k = 2
    if k > vectors_num:
        K = vectors_num
    # Get top k similarities and pair GUID in ascending order
    top_k_indexes = np.unravel_index(np.argsort(similarities.ravel())[-k:], similarities.shape)
    top_k_similarities = similarities[top_k_indexes]
    top_k_pair_GUID = []
    for indexes in top_k_indexes:
        pair_GUID = (df.iloc[indexes[0]]["GUID"], df.iloc[indexes[1]]["GUID"])
        top_k_pair_GUID.append(pair_GUID)
    
    print("top_k_indexes: \n{}\ntop_k_pair_GUID: \n{}\ntop_k_similarities: \n{}".format(top_k_indexes, top_k_pair_GUID, top_k_similarities))
    

    输出:

    Data:
       GUID             Vector
    0  b770  [-0.1, -0.2, 0.3]
    1  808f  [0.1, -0.2, -0.3]
    2  b111  [-0.1, 0.2, -0.3]
    
    Similarities:
    [[-2.         -0.42857143 -0.85714286] 
     [-2.         -2.          0.28571429] 
     [-2.         -2.         -2.        ]]
    
    top_k_indexes:
    (array([0, 1], dtype=int64), array([1, 2], dtype=int64))
    top_k_pair_GUID:
    [('b770', '808f'), ('808f', 'b111')]
    top_k_similarities:
    [-0.42857143  0.28571429]
    

    【讨论】:

    • @TranTam 很高兴为您提供帮助:)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-07-18
    • 1970-01-01
    • 1970-01-01
    • 2020-08-12
    • 1970-01-01
    • 1970-01-01
    • 2019-07-21
    相关资源
    最近更新 更多