使用 Python Cuda 创建共享内存代码答案

【问题标题】：Created Shared Memory Code with Python Cuda使用 Python Cuda 创建共享内存代码
【发布时间】：2020-03-30 17:36:37
【问题描述】：

我正在努力让一些代码运行以探索共享内存功能以获得快速矩阵乘法。但是每次我尝试这个时，我似乎都会遇到我无法理解的错误。

import numpy as np
from numba import cuda, types
m = 128
n = 32
a = np.arange(m*n).reshape(m,n).astype(np.int32)
b = np.arange(m*n).reshape(n,m).astype(np.int32)
c = np.zeros((m, n)).astype(np.int32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)

block_size = (m,n)
grid_size = (int(m/n),int(m/n))


@cuda.jit
def mm(a, b, c):
    column, row = cuda.grid(2)
    sum = 0

    # `a_cache` and `b_cache` are already correctly defined
    a_cache = cuda.shared.array(block_size, types.int32)
    b_cache = cuda.shared.array(block_size, types.int32)


    a_cache[cuda.threadIdx.y, cuda.threadIdx.x] = a[row, column]
    b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[column, row]
    cuda.syncthreads()
    for i in range(a.shape[1]):
        sum += a_cache[row][i] * b_cache[i][column]
    c[row][column] = sum

和测试

mm[grid_size, block_size](d_a, d_b, d_c)
solution = a@b
output = d_c.copy_to_host()

不断导致以下错误：

CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR

在与一个答案的提供者聊天后，我更新了功能。但仍然无法完成这项工作。因此，为了计算输出 c 中每个元素的总和，我们需要循环 A 的列和 B 的行，使用 i 作为索引。因此，我们有 n*n 个产品。我认为我在总和中是正确的，但我似乎无法在总和表达式中获得 a 和 b 的行和列的正确索引。

import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(a, b, c):
    column, row = cuda.grid(2)
    sum = 0

    # `a_cache` and `b_cache` are already correctly defined
    a_cache = cuda.shared.array(block_size, types.int32)
    b_cache = cuda.shared.array(block_size, types.int32)


    a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, column]
    b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[row, column]

    cuda.syncthreads()


    for i in range(a.shape[1]):

        sum += a_cache[cuda.threadIdx.x, i] * b_cache[i, cuda.threadIdx.y]

    c[row][column] = sum

【问题讨论】：

该代码中肯定缺少 jit 装饰器吗？
@talonmies 已修复

标签： cuda numba

【解决方案1】：

您的块大小无效。 CUDA 设备有一个limit，每个块有 1024 个线程。当我运行你的代码时，我看到了这个：

/opt/miniconda3/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py in _check_error(self, fname, retcode)
    327                     _logger.critical(msg, _getpid(), self.pid)
    328                     raise CudaDriverError("CUDA initialized before forking")
--> 329             raise CudaAPIError(retcode, msg)
    330 
    331     def get_device(self, devnum=0):

CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE

当我修复它时，我看到了这个：

$ cuda-memcheck python somethingsometing.py

========= CUDA-MEMCHECK
========= Invalid __shared__ read of size 4
=========     at 0x000008b0 in cudapy::__main__::mm$241(Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>)
=========     by thread (15,11,0) in block (3,2,0)
=========     Address 0x00000ec0 is out of bounds

原因很明显：

for i in range(a.shape[1]):
    sum += a_cache[row][i] * b_cache[i][column]

row 和column 是执行网格中的维度，而不是本地共享内存块，同样i 的边界是a 的形状，而不是a_cache 的形状（另请注意，您似乎在代码进行到一半的时候遗漏了 C 样式的 2D 数组索引语法，如果您不了解 Python 中两者之间的区别，这是一个潜在的错误。

要修复它，您必须更改索引，然后实现乘法的其余代码（即，您必须通过本地共享切片迭代加载整个行和列切片以计算每行的完整点积/块将处理的列对）。

还要注意

您为c 选择的尺寸有误（应为 m x m）
您运行内核的网格大小也是错误的，因为 C 的维度是错误的，因此您的代码永远无法计算整个矩阵
即使在解决了所有这些问题之后，由于整数溢出，乘法的结果很可能会不正确，而不是微不足道的大小。

【讨论】：

谢谢。仍然有这个问题。第一点是一个错字（现在是 mxm）。我仍然无法让它工作。我可以在原始问题中发布更新的代码。
请再次阅读我的回答：“同样，我的边界是 a 的形状，而不是 a_cache 的形状”——你在更新的代码中仍然犯同样的错误

【解决方案2】：

@disruptive：嗨，你找到解决问题的方法了吗？我和你有同样的问题，但我通过重启 Jupyter notebook 的内核解决了。

我的代码与你的略有不同：

def mm_shared(a, b, c):
    sum = 0

    # `a_cache` and `b_cache` are already correctly defined
    a_cache = cuda.shared.array(block_size, types.int32)
    b_cache = cuda.shared.array(block_size, types.int32)

    col, row = cuda.grid(2)

    row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y

    a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row][col]
    b_cache[cuda.threadIdx.y, cuda.threadIdx.x] = b[col][row]

    for i in range(a.shape[1]):
        a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, cuda.threadIdx.y + i * N]
        b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[cuda.threadIdx.x + i * N, col]

        cuda.syncthreads()

        for j in range(N):
            sum += a_cache[cuda.threadIdx.x, j] * b_cache[j, cuda.threadIdx.y]

        # Wait until all threads finish computing
        cuda.syncthreads()

    c[row][col] = sum

如果您有任何更新，请告诉我。

【讨论】：

【解决方案3】：

这是正确的解决方案：

import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(a, b, c):
    sum = 0

    # `a_cache` and `b_cache` are already correctly defined
    a_cache = cuda.shared.array(block_size, types.int32)
    b_cache = cuda.shared.array(block_size, types.int32)

    # TODO: use each thread to populate one element each a_cache and b_cache
    x,y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x
    TPB = int(N)
    
    for i in range(a.shape[1] / TPB):
        a_cache[tx, ty] = a[x, ty + i * TPB]
        b_cache[tx, ty] = b[tx + i * TPB, y]
    
    cuda.syncthreads()
    for j in range(TPB):#a.shape[1]):
        # TODO: calculate the `sum` value correctly using values from the cache 
        sum += a_cache[tx][j] * b_cache[j][ty]
    cuda.syncthreads()    
    c[x][y] = sum

【讨论】：