【发布时间】:2013-10-14 23:02:44
【问题描述】:
这里我尝试使用一些伪代码自我解释CUDA启动参数模型(或执行配置模型),但我不知道是否有一些大错误,所以希望有人帮助复习一下,给我一些建议。谢谢高级。
这里是:
/*
normally, we write kernel function like this.
note, __global__ means this function will be called from host codes,
and executed on device. and a __global__ function could only return void.
if there's any parameter passed into __global__ function, it should be stored
in shared memory on device. so, kernel function is so different from the *normal*
C/C++ functions. if I was the CUDA authore, I should make the kernel function more
different from a normal C function.
*/
__global__ void
kernel(float *arr_on_device, int n) {
int idx = blockIdx.x * blockDIm.x + threadIdx.x;
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
/*
after this definition, we could call this kernel function in our normal C/C++ codes !!
do you feel something wired ? un-consistant ?
normally, when I write C codes, I will think a lot about the execution process down to
the metal in my mind, and this one...it's like some fragile codes. break the sequential
thinking process in my mind.
in order to make things normal, I found a way to explain: I expand the *__global__ * function
to some pseudo codes:
*/
#define __foreach(var, start, end) for (var = start, var < end; ++var)
__device__ int
__indexing() {
const int blockId = blockIdx.x * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
return
blockId * (blockDim.x * blockDim.y * blockDim.z) +
threadIdx.z * (blockDim.x * blockDim.y) +
threadIdx.x;
}
global_config =:
{
/*
global configuration.
note the default values are all 1, so in the kernel codes,
we could just ignore those dimensions.
*/
gridDim.x = gridDim.y = gridDim.z = 1;
blockDim.x = blockDim.y = blockDim.z = 1;
};
kernel =:
{
/*
I thought CUDA did some bad evil-detail-covering things here.
it's said that CUDA C is an extension of C, but in my mind,
CUDA C is more like C++, and the *<<<>>>* part is too tricky.
for example:
kernel<<<10, 32>>>(); means kernel will execute in 10 blocks each have 32 threads.
dim3 dimG(10, 1, 1);
dim3 dimB(32, 1, 1);
kernel<<<dimG, dimB>>>(); this is exactly the same thing with above.
it's not C style, and C++ style ? at first, I thought this could be done by
C++'s constructor stuff, but I checked structure *dim3*, there's no proper
constructor for this. this just brroke the semantics of both C and C++. I thought
force user to use *kernel<<<dim3, dim3>>>* would be better. So I'd like to keep
this rule in my future codes.
*/
gridDim = dimG;
blockDim = dimB;
__foreach(blockIdx.z, 0, gridDim.z)
__foreach(blockIdx.y, 0, gridDim.y)
__foreach(blockIdx.x, 0, gridDim.x)
__foreach(threadIdx.z, 0, blockDim.z)
__foreach(threadIdx.y, 0, blockDim.y)
__foreach(threadIdx.x, 0, blockDim.x)
{
const int idx = __indexing();
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
};
/*
so, for me, gridDim & blockDim is like some boundaries.
e.g. gridDim.x is the upper bound of blockIdx.x, this is not that obvious for people like me.
*/
/* the declaration of dim3 from vector_types.h of CUDA/include */
struct __device_builtin__ dim3
{
unsigned int x, y, z;
#if defined(__cplusplus)
__host__ __device__ dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1) : x(vx), y(vy), z(vz) {}
__host__ __device__ dim3(uint3 v) : x(v.x), y(v.y), z(v.z) {}
__host__ __device__ operator uint3(void) { uint3 t; t.x = x; t.y = y; t.z = z; return t; }
#endif /* __cplusplus */
};
typedef __device_builtin__ struct dim3 dim3;
【问题讨论】:
-
有很多课程可供选择,包括介绍性课程,您可以随时观看,只需 1 小时,here。试试
GPU Computing using CUDA C系列。 -
你读过我下面的解释了吗?
-
您似乎对网格和块感到非常困惑,更不用说非常重要的底层架构了。我在 udacity 注册了免费的在线 CUDA 课程,并在一周内获得了有效(如果不是很高级)的编码。检查一下,因为 GPU 编程似乎需要坚实的基础。
-
@Boyko 我查过你建议的课程,特别是解释内核启动配置的部分udacity.com/course/viewer#!/c-cs344/l-55120467/m-67074291(从这个剪辑开始,然后是3个剪辑),老师仍然没有说清楚GPU如何将任务分派给线程。看看那个内核函数,编译器是如何扩展代码的?很多人只是被告知要记住规则,但是规则是如何在代码中实施的呢?所以现在,我正在阅读驱动程序 API 和 OpenCL 文档,只是想了解幕后发生的事情。
-
制作了更高级别的语言,因此您不必担心代码如何将其转化为 CPU 上的机器指令。而且每一个新一代的 CPU 都会带来一些新的东西,即使指令集是一样的,但在不同的架构上实际发生的事情并不一定是一样的。