【发布时间】:2020-10-18 20:05:20
【问题描述】:
我是 Cuda 的新手,我正在尝试使用 Cuda 将我现有的项目迁移到 GPU。 我的代码基于复杂的矩阵和复杂的缓冲区。
第一步,我尝试将嵌套的 For 循环代码移动到 Cuda(其余部分类似):
typedef thrust::complex<double> smp_t;
uint8_t *binbuffer = (uint8_t*) malloc(8 * bufsize * sizeof(uint8_t));
smp_t *sgbuf = (smp_t*) malloc(8 * bufsize * sizeof(smp_t));
smp_t *cnbuf = (smp_t*) malloc(8 * bufsize * sizeof(smp_t));
// Create matrix.
thrust::complex<double> i_unit(0.0, 1.0);
thrust::host_vector<thrust::host_vector<smp_t>> tw(decfactor);
// Fill the Matrix
for (size_t row = 0; row < 8; row++) {
for (size_t col = 0; col < 8; col++) {
std::complex<double> tmp =
exp(-i_unit * 2.0*M_PI * ((double) col*row) / (double)8);
tw[row].push_back(tmp);
}
}
/* The Code To Move to the GPU processing */
for (unsigned int i = 0; i < bufsize; i++) {
for (size_t ch = 0; ch < 8; ch++)
for (size_t k = 0; k < 8; k++)
cnbuf[ch*bufsize + i] += sgbuf[k*bufsize+i] * tw[ch].at(k);
}
这是 .cu 文件中的代码,它将替换当前嵌套的 for 循环:
__global__ void kernel_func(cuDoubleComplex *cnbuf, cuDoubleComplex *sgbuf, smp_t *tw, size_t block_size) {
unsigned int ch = threadIdx.x;
unsigned int k = blockIdx.x;
for (int x = 0; x < block_size; ++x) {
unsigned int sig_index = k*block_size+x;
unsigned int tw_index = ch*k;
unsigned int cn_index = ch*block_size+x;
cuDoubleComplex temp = cuCmul(sgbuf[sig_index], make_cuDoubleComplex(tw[tw_index].real(), tw[tw_index].imag()));
cnbuf[cn_index] = cuCadd(temp, cnbuf[cn_index]);
}
}
void kernel_wrap(
smp_t *cnbuf,
smp_t *sgbuf,
thrust::host_vector<thrust::host_vector<smp_t>>tw,
size_t buffer_size) {
smp_t *d_sgbuf;
smp_t *d_cnbuf;
thrust::device_vector<smp_t> d_tw(8*8);
thrust::copy(&tw[0][0], &tw[7][7], d_tw.begin());
cudaMalloc((void **)&d_sgbuf, buffer_size);
cudaMalloc((void **)&d_cnbuf, buffer_size);
cudaMemcpy(d_sgbuf, sgbuf, buffer_size, cudaMemcpyDeviceToHost);
cudaMemcpy(d_cnbuf, cnbuf, buffer_size, cudaMemcpyDeviceToHost);
thrust::raw_pointer_cast(d_tw.data());
kernel_func<<<8, 8>>>(
reinterpret_cast<cuDoubleComplex*>(d_cnbuf),
reinterpret_cast<cuDoubleComplex*>(d_sgbuf),
thrust::raw_pointer_cast(d_tw.data()),
buffer_size
);
cudaError_t varCudaError1 = cudaGetLastError();
if (varCudaError1 != cudaSuccess)
{
std::cout << "Failed to launch subDelimiterExamine kernel (error code: " << cudaGetErrorString(varCudaError1) << ")!" << std::endl;
exit(EXIT_FAILURE);
}
cudaMemcpy(sgbuf, d_sgbuf, buffer_size, cudaMemcpyHostToDevice);
cudaMemcpy(cnbuf, d_cnbuf, buffer_size, cudaMemcpyHostToDevice);
}
当我运行代码时,我得到了错误:
Failed to launch subDelimiterExamine kernel (error code: invalid argument)!
我认为引起麻烦的论点是“d_tw”。 所以,我的问题是:
- 将 <:host_vector smp_t>> 转换为 <:device_vector smp_t>>(从 2d 矩阵到一个扁平 arr)我做错了什么?
- 在 CUDA 中处理二维复数是否有更好的方法?
- 关于 Cuda 中复杂数组的文档非常差,我在哪里可以阅读大量关于 Cuda 复杂矩阵的工作?
谢谢!!!!
【问题讨论】:
-
推力容器仅适用于 POD 类型。不要尝试使用向量的向量。它不会工作
标签: c++ matrix cuda complex-numbers thrust