【问题标题】:Calculate cosine similarity for between all cases in a dataframe fast快速计算数据帧中所有案例之间的余弦相似度
【发布时间】:2021-04-02 14:32:07
【问题描述】:

我正在做一个 NLP 项目,我必须比较许多句子之间的相似性 例如。来自这个数据框:

我尝试的第一件事是将数据框与其自身进行连接以获得以下格式并逐行比较:

对于大中型/大型数据集,我很快就会耗尽内存, 例如对于 10k 行加入,我将获得 100MM 行,我无法放入 ram

我目前的方法是遍历数据框:

final = pd.DataFrame()

### for each row 
for i in range(len(df_sample)):

    ### select the corresponding vector to compare with 
    v =  df_sample[df_sample.index.isin([i])]["use_vector"].values
    ### compare all cases agains the selected vector
    df_sample.apply(lambda x:  cosine_similarity_numba(x.use_vector,v[0])  ,axis=1)

    ### kept the cases with a similarity over a given th, in this case 0.6
    temp = df_sample[df_sample.apply(lambda x:  cosine_similarity_numba(x.use_vector,v[0])  ,axis=1) > 0.6]  
    ###  filter out the base case 
    temp = temp[~temp.index.isin([i])]
    temp["original_question"] = copy.copy(df_sample[df_sample.index.isin([i])]["questions"].values[0])
    ### append the result     
    final = pd.concat([final,temp])

但是这种方法也不是很快。 如何提高此流程的性能?

【问题讨论】:

    标签: python pandas numpy nlp linear-algebra


    【解决方案1】:

    您可能采用的一个可能技巧是从 Facebook 的 fasttext 中从稀疏 tfidf 表示切换到密集词嵌入:

    import fasttext
    # wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
    model = fasttext.load_model("./cc.en.300.bin")
    

    然后,您可以使用更节省空间、上下文感知和性能更好(?)密集词嵌入来计算余弦相似度:

    df = pd.DataFrame({"questions":["This is a question",
                                    "This is a similar questin",
                                    "And this one is absolutely different"]})
    
    df["vecs"] = df["questions"].apply(model.get_sentence_vector)
    
    from scipy.spatial.distance import pdist, squareform
    # only pairwise distance with itself
    # vectorized, no doubling data
    out = pdist(np.stack(df['vecs']), metric="cosine")
    cosine_similarity = squareform(out)
    print(cosine_similarity)
    

    [[0.         0.08294727 0.25305626]
     [0.08294727 0.         0.23575631]
     [0.25305626 0.23575631 0.        ]]
    

    另外请注意,除了内存效率之外,由于使用来自 scipy 的余弦相似度,您还可以获得大约 10x 的速度提升。

    另一个可能的技巧是将相似向量从默认的float64 转换为float32float16

    df["vecs"] = df["vecs"].apply(np.float16)
    

    这将为您带来速度和内存增益。

    【讨论】:

      【解决方案2】:

      我昨天刚写了一个和你类似的问题的答案,就是Top-K Cosine Similarity rows in a dataframe of pandas

      import numpy as np
      import pandas as pd
      from sklearn.metrics.pairwise import cosine_similarity
      
      data = {"use_vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
      df = pd.DataFrame(data)
      print("Data: \n{}\n".format(df))
      
      vectors = []
      for v in df['use_vector']:
          vectors.append(v)
      vectors_num = len(vectors)
      A=np.array(vectors)
      # Get similarities matrix, value for each pair at corresponding index of upper triangle of matrix
      similarities = cosine_similarity(A)
      # Set symmetrical(repetitive) and diagonal(similarity to self) to -2
      similarities[np.tril_indices(vectors_num)] = -2
      print("Similarities: \n{}\n".format(similarities))
      

      输出:

      Data: 
                use_vector
      0  [-0.1, -0.2, 0.3]
      1  [0.1, -0.2, -0.3]
      2  [-0.1, 0.2, -0.3]
      
      Similarities:
      [[-2.         -0.42857143 -0.85714286]  # vector 0 & 1, 2
       [-2.         -2.          0.28571429]  # vector 1 & 2
       [-2.         -2.         -2.        ]]
      

      【讨论】:

        猜你喜欢
        • 2018-03-27
        • 2022-01-07
        • 2018-10-02
        • 2017-03-19
        • 2013-06-24
        • 2021-03-27
        • 1970-01-01
        • 1970-01-01
        • 2015-05-24
        相关资源
        最近更新 更多