【问题标题】:CUDA sum to the rightCUDA 向右求和
【发布时间】:2021-06-08 14:38:40
【问题描述】:

我正在尝试使用 CUDA 实现总和减少,但是我希望减少在右侧而不是左侧.. 我写了下面的代码,但我不确定为什么它不起作用

__global__ void reduce_kernel(
    float *input,
    float *partialSums,
    unsigned int N) 
{
    unsigned int segment = blockIdx.x * blockDim.x * 2;
    unsigned int i = segment + threadIdx.x;
    __shared__ float input_s[BLOCK_DIM];

    input_s[threadIdx.x] = input[i] + input[i + BLOCK_DIM];
    int count = 2;
    __syncthreads();

    for (unsigned int stride = BLOCK_DIM / 2; 
         stride < BLOCK_DIM;
         stride = stride + (BLOCK_DIM / count)) 
    {
        if (threadIdx.x >= stride) {
            count = count * 2;
            input_s[threadIdx.x] += input_s[threadIdx.x - stride];
            printf("%d  ", stride);
            __syncthreads();
            if (stride == BLOCK_DIM - 1) {
                break;
            }
        }
        __syncthreads();
    }

    if (threadIdx.x == BLOCK_DIM - 1) {
        partialSums[blockIdx.x] = input_s[threadIdx.x];
    }
}

任何想法我做错了什么?

【问题讨论】:

    标签: cuda reduction prefix-sum


    【解决方案1】:

    只要输入的元素数量是 2 的幂,这应该完全符合您的要求。部分总和应该在右边结束。这种算法的步幅必须从1 增长到BLOCK_DIM / 2(产生更多的warp 分歧)或从BLOCK_DIM / 2 缩小到1。无论哪种方式,它都应该乘以/除以2

    __global__ void reduce_kernel(
        float *input,
        float *partialSums,
        unsigned int N) 
    {
        unsigned int segment = blockIdx.x * blockDim.x * 2;
        unsigned int i = segment + threadIdx.x;
        __shared__ float input_s[BLOCK_DIM];
    
        input_s[threadIdx.x] = input[i] + input[i + BLOCK_DIM];
        __syncthreads();
    
        for (unsigned int stride = BLOCK_DIM / 2; 
             stride > 0;
             stride /= 2) 
        {
            if (threadIdx.x >= BLOCK_DIM - stride) {
                input_s[threadIdx.x] += input_s[threadIdx.x - stride];
            }
            __syncthreads();
        }
    
        if (threadIdx.x == BLOCK_DIM - 1) {
            partialSums[blockIdx.x] = input_s[threadIdx.x];
        }
    }
    

    条件中的__syncthreads(); 是另一个错误,因为块的所有线程都必须参与同步。否则会导致未定义的行为。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-06-11
      • 2022-01-13
      • 1970-01-01
      相关资源
      最近更新 更多