【发布时间】:2020-03-30 17:36:37
【问题描述】:
我正在努力让一些代码运行以探索共享内存功能以获得快速矩阵乘法。但是每次我尝试这个时,我似乎都会遇到我无法理解的错误。
import numpy as np
from numba import cuda, types
m = 128
n = 32
a = np.arange(m*n).reshape(m,n).astype(np.int32)
b = np.arange(m*n).reshape(n,m).astype(np.int32)
c = np.zeros((m, n)).astype(np.int32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)
block_size = (m,n)
grid_size = (int(m/n),int(m/n))
@cuda.jit
def mm(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.y, cuda.threadIdx.x] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[column, row]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
c[row][column] = sum
和测试
mm[grid_size, block_size](d_a, d_b, d_c)
solution = a@b
output = d_c.copy_to_host()
不断导致以下错误:
CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
在与一个答案的提供者聊天后,我更新了功能。但仍然无法完成这项工作。因此,为了计算输出 c 中每个元素的总和,我们需要循环 A 的列和 B 的行,使用 i 作为索引。因此,我们有 n*n 个产品。我认为我在总和中是正确的,但我似乎无法在总和表达式中获得 a 和 b 的行和列的正确索引。
import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[row, column]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[cuda.threadIdx.x, i] * b_cache[i, cuda.threadIdx.y]
c[row][column] = sum
【问题讨论】:
-
该代码中肯定缺少 jit 装饰器吗?
-
@talonmies 已修复