【问题标题】:How to speed up cosine similarity between a numpy array and a very very large matrix?如何加快 numpy 数组和非常大的矩阵之间的余弦相似度?
【发布时间】:2017-12-05 21:04:00
【问题描述】:

我有一个问题,需要在形状 (1, 300) 的 numpy 数组和形状 (5000000, 300) 的矩阵之间计算 cosine similarities。我尝试了多种不同风格的代码,现在我想知道是否有办法大幅减少运行时间:

版本 1:我将我的大矩阵分成 5 个较小的矩阵,每个大小为 1Mil:

from scipy import spatial
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

def cos_matrix_multiplication(vector,matrix_1):

    v = vector.reshape(1, -1)
    scores1=spatial.distance.cdist(matrix_1, v, 'cosine')

    return((scores1[:1]))

pool = ThreadPoolExecutor(8)


URLS=[mat_small1,mat_small2,mat_small3,mat_small4,mat_small5]

neighbors=[]
with concurrent.futures.ThreadPoolExecutor(max_workers=30) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(cos_matrix_multiplication,vec,mat_col): mat_col for mat_col in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        data = future.result()
        neighbors.append(data)

运行时间:2.48 秒

版本 2:使用 Numba jit:灵感来自 SO answer

@numba.jit('void(f4, f4)',nogil=True)
def cosine_sim(A,B):
    scores = np.zeros(A.shape[0])
    for i in range(A.shape[0]):
        v = A[i]
        m = B.shape[1]
        udotv = 0
        u_norm = 0
        v_norm = 0
    for j in range(m):


        udotv += B[0][j] * v[j]
        u_norm += B[0][j] * B[0][j]
        v_norm += v[j] * v[j]

    ratio =  udotv/((u_norm*v_norm)**0.5)
    scores[i] = ratio
    i += 1
return scores

cosine_sim(matrix,vec)

运行时间 2.34 秒

版本 3:使用 Cuda jit(不能每次都以可重现的方式真正开始工作)

@cuda.jit
def cosine_sim(A,B,C):
#scores = np.zeros(A.shape[0])
    for i in range(A.shape[0]):
        v = A[i]
        m = B.shape[1]
        udotv = 0
        u_norm = 0
        v_norm = 0
        for j in range(m):

            udotv += B[0][j] * v[j]
            u_norm += B[0][j] * B[0][j]
            v_norm += v[j] * v[j]

    u_norm = math.sqrt(u_norm)
    v_norm = math.sqrt(v_norm)


    if (u_norm == 0) or (v_norm == 0):
        ratio = 1.0
    else:
        ratio = udotv / (u_norm * v_norm)
    C[i,1] = ratio
    i += 1


matrix = mat_small1

A_global_mem = cuda.to_device(matrix)
B_global_mem = cuda.to_device(vec)

C_global_mem = cuda.device_array((matrix.shape[0], 1))


threadsperblock = (16, 16)
blockspergrid_x = int(math.ceil(A_global_mem.shape[0] / threadsperblock[0]))
blockspergrid_y = int(math.ceil(B_global_mem.shape[1] / threadsperblock[1]))
blockspergrid = (blockspergrid_x, blockspergrid_y)


cosine_sim[blockspergrid, threadsperblock](A_global_mem, B_global_mem, C_global_mem)


C = C_global_mem.copy_to_host()

结果: CudaAPIError: [702] Call to cuMemcpyDtoH results in CUDA_ERROR_LAUNCH_TIMEOUT

矩阵很密集,我的 GPU 是 8gb ram,矩阵的总大小约为 4.7gb。 GPU 可以加速这一切吗?

【问题讨论】:

  • 你写的CUDA内核是完全串行的
  • 你可以试试np.apply_along_axis。 IE。 np.apply_along_axis(lambda v: spatial.distance.cosine(matrix_1, v), axis=1, arr=matrix_2)

标签: python cuda gpu numba cosine-similarity


【解决方案1】:

请尝试用 ProcessPoolExecutor 替换 ThreadPoolExecutor(您已经声明了它)。 前者用于异步调用而不是 CPU 绑定任务,尽管文档中没有直接指定。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2017-06-13
    • 2020-10-28
    • 2019-02-01
    • 2017-04-15
    • 2018-10-02
    • 2015-07-21
    • 2019-05-21
    • 2021-08-20
    相关资源
    最近更新 更多