数据框与自身的相关性答案

【问题标题】：Correlation of dataframe with itself数据框与自身的相关性
【发布时间】：2021-10-19 13:38:32
【问题描述】：

我有一个如下所示的数据框：

import pandas as pd
a=pd.DataFrame([[name1, name2, name3, name4],[text1, text2, text3, text4]],
               columns=(['names','texts']))

我已经使用GloVe实现了一个函数来执行每个文本中单词之间的余弦相似度。

def cosine_distance_wordembedding_method(s1, s2):
    import scipy
    import scipy.spatial
    vector_1 = np.mean([glove[word] if word in glove else 0 for word in preprocess(s1)], axis=0)
    vector_2 = np.mean([glove[word] if word in glove else 0 for word in preprocess(s2)], axis=0)
    cosine = scipy.spatial.distance.cosine(vector_1, vector_2)
    return 1 - cosine

现在，我想将此函数应用于我的数据框的所有行，并将 texts 列与其自身进行比较。所以生成的数据框应该是这样的（即相关矩阵）：

      name1 name2 name3 name4
name1 1     0.95  0.79  0.4
name2 0.95  1     0.85  0.65
name3 0.79  0.85  1     0.79
name4 0.66  0.65  0.79  1.00000

我已经完成了 2 种方法来实现这一点，它们都很慢。我想知道是否有另一个可能更快。

第一种方式：

df = a.texts.apply(lambda text1: a.texts.apply(lambda text2: cosine_distance_wordembedding_method(text1, text2)))

第二种方式：

# Create a dataframe to store output.
df = pd.DataFrame(index=a.index, columns = a.index)

# Compute the similarities
for index1, row1 in a.iterrows():
    for index2, row2 in a.iterrows():
        df.loc[index1, index2] = cosine_distance_wordembedding_method(row1["eng_text"], row2["eng_text"])

【问题讨论】：

标签： python pandas correlation

【解决方案1】：

scipy.spatial.cdist 已矢量化。因此，您可以计算文本的所有向量表示并使用一次cdist：

from scipy.spatial import cdist

vectors = [np.mean([glove[word] if word in glove else 0 for word in preprocess(s1)] for s1 in df['texts'] ]

distance = 1 - cdist(vectors, vectors, metric='consine')

请记住，这种方式可能需要大量内存，尤其是当您拥有大量文本数据时。

【讨论】：