Python - 使用 scipy 加速余弦相似度答案

【问题标题】：Python - speed up cosine similarity with scipyPython - 使用 scipy 加速余弦相似度
【发布时间】：2019-03-27 03:56:18
【问题描述】：

以下问题来自我之前提出的问题：Python - How to speed up cosine similarity with counting arrays

在使用建议的解决方案时，我面临一个很大的复杂性问题，基本上，我的实现需要大量时间来构建余弦相似度矩阵。在我正在使用的代码下方：

import numpy as np
import pandas as pd
import networkx as nx
from scipy import spatial

def compute_other(user_1, user_2):
    uniq = list(set(user_1[0] + user_2[0]))

    duniq = {k:0 for k in uniq}    

    u1 = create_vector(duniq, list(user_1[0]))
    u2 = create_vector(duniq, list(user_2[0]))

    return 1 - spatial.distance.cosine(u1, u2)

# START
distances = spatial.distance.cdist(df[['ARTIST']], df[['ARTIST']], metric=compute_other)

idx_to_remove = np.triu_indices(len(distances))
distances[idx_to_remove] = 0

df_dist = pd.DataFrame(distances, index = df.index, columns = df.index)
edges = df_dist.stack().to_dict()
edges = {k: v for k, v in edges.items() if v > 0}

print('NET inference')
net = nx.Graph()
net.add_nodes_from(df.index)
net.add_edges_from(edges)

我注意到的第一件事是我计算了完整的矩阵并删除了一半，所以只计算一半会很酷我需要（那将是 x2）。

那个df的结构：

ARTIST
"(75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 15053)"
"(55852, 55852, 17727, 17727, 2182)"
"(11446, 11446, 11446, 11446, 11446, 11446, 11446, 11446)"
"(54795,)"
"(22873, 22873, 22873, 22873)"
"(5634, 5634)"
"(311, 18672)"
"(1740, 1740, 1740, 1740, 1746, 15048, 15048, 1740)"
"(1788, 1983, 1788, 1748, 723, 100744, 723, 226, 1583, 12188, 51325, 1748, 75401, 1171)"
"(59173, 59173)"
"(2673, 2673, 2673, 2673, 2673, 2673, 2673, 5634, 5634, 5634)"
"(2251, 4229, 14207, 1744, 16366, 1218)"
"(19703, 1171, 1171)"
"(12877,)"
"(1243, 8249, 2061, 1243, 13343, 9868, 574509, 892, 1080, 1243, 3868, 2061, 4655)"
"(1229,)"
"(3868, 60112, 11084)"
"(15869, 15869, 15869, 15869)"
"(4067, 4067, 4067, 4067, 4067, 4067)"
"(1171, 1171, 1171, 1171)"
"(1245, 1245, 1245, 1245, 1245, 1245, 1245, 1245, 1245, 1195, 1193, 1193, 1193, 1193, 1193, 1193)"
"(723, 723)"

这个dataset 是完整的，可以与我发布的代码一起使用。只需将其读取为带有熊猫的普通 csv 并应用该功能：

import ast
import pandas as pd

df = pd.read_csv('Stack.csv')
df['ARTIST'] = df['ARTIST'].apply(lambda x : ast.literal_eval(x))

这段代码几乎在166 中执行。我在我的 8 核处理器上并行执行 8 个进程，每个进程在不同的数据集上计算相同的函数。老实说，我不知道这是否已经是最优化的版本，但是，像我之前解释的那样删除一半的计算会非常有用（从166 到83）。

编辑：在 create_vector 函数下方：

def create_vector(duniq, l):
    dx = duniq.copy()
    dx.update(Counter(l)) # Count the values
    return list(dx.values()) # Return a list

【问题讨论】：

为什么不使用cdist，而不是pdist？这将消除多余的距离计算。
谢谢！现在它使用了一半的时间。我还能做些什么来加快速度？ @WarrenWeckesser

标签： python pandas scipy

【解决方案1】：

我试图对此进行修改，但是我在这两行中遇到了编译错误： u1 = create_vector(duniq, list(user_1[0])) u2 = create_vector(duniq, list(user_2[0]))

create_vector() 是您构建但未发布的定义吗？

我怀疑在您的 df 上使用掩码可能会通过删除您正在使用的覆盖来提高性能距离[idx_to_remove] = 0 并且应该减少迭代次数 "edges = {k: v for k, v in edges.items() if v > 0}"

如果您可以发布 create_vector() 的来源或 def 本身，我想测试一个掩码。这是一个有趣的问题。

嗨@Guido。抱歉花了这么长时间，但这很难破解！在尝试了一些不同的事情（花费了更长的时间）之后，我想出了以下内容来代替您的 create_vector() 和 compute_other() 函数：

def compute_other2(user_1, user_2):
    uniq = set(user_1[0] + user_2[0]) #create list of unique list of items in user _1 and user_2   
    u1 = [user_1[0].count(ui) for ui in uniq]
    u2 = [user_2[0].count(ui) for ui in uniq]
    return 1 - spatial.distance.cosine(u1, u2)

我获得了 20% 的性能提升，低于我的预期，但有所改善。注意：我仍在使用“spatial.distance.cdist”运行您的代码。我确实看到您通过切换到“spatial.distance.pdist”获得了 50%。我不确定您是如何使用它的，而且（我怀疑是矢量数学）超出了我的理解范围。也许您可以将这个新的 compute_other() 函数与 spatial.distance.pdist 一起使用并获得更多收益。

附：如果您尝试此操作，请验证结果。我根据您的原始代码检查了我的代码，这对我来说似乎是正确的。

【讨论】：

我用所需的功能更新了问题，谢谢。 @Ethan