访问/同步到本地内存答案

【问题标题】：Access/synchronization to local memory访问/同步到本地内存
【发布时间】：2017-04-23 00:08:21
【问题描述】：

我对 GPGPU 编程很陌生。我正在尝试实现需要大量同步的算法，因此它只使用一个工作组（全局和本地大小具有相同的值）

我有休闲问题：我的程序正常运行，直到问题大小超过 32。

__kernel void assort(
__global float *array,
__local float *currentOutput,
__local float *stimulations,
__local int *noOfValuesAdded,
__local float *addedValue,
__local float *positionToInsert,
__local int *activatedIdx,
__local float *range,
int size,
__global float *stimulationsOut
)
{
int id = get_local_id(0);
if (id == 0) {...}

barrier(CLK_LOCAL_MEM_FENCE);

for (int i = 2; i < size; i++) 
{
    int maxIdx;
    if (id == 0) 
   {
   addedValue[0] = array[i];
   {...}
   }
    barrier(CLK_LOCAL_MEM_FENCE);


    if (id < noOfValuesAdded[0]){...}
    else
        barrier(CLK_LOCAL_MEM_FENCE);
   barrier(CLK_LOCAL_MEM_FENCE);
   if (activatedIdx[0] == -2) {...}
   else {...}

   barrier(CLK_LOCAL_MEM_FENCE);
   if (positionToInsert[0] != -1.0f) {...}

    barrier(CLK_LOCAL_MEM_FENCE);
    stimulationsOut[id] = addedValue[0];
    return;
    }

经过一些调查尝试后，我意识到（通过检查刺激输出）， addedValue[0] 与内核的第 33 个实例具有不同的值，然后是第 65 个实例的另一个值（所以它类似于 [123 123 123 ... 123 (第 33 个元素）66 ... 66 66 66 66 66 ..（第 65 个元素）127 ..... 127 ...]）

__global float *array 是 READ_ONLY ，如果在 for 循环中，我不会在 first 旁边更改 addedValue[0] 。什么可能导致这个问题？

我的 GPU 规格：[https://devtalk.nvidia.com/default/topic/521502/gt650m-a-kepler-part-/]

注释掉两个 if 的身体问题后没有复发：

            /*if (activatedIdx[0] == -2) 
        {
            if (noOfValuesAdded[0] == 2) 
            {
                positionToInsert[0] = 0.99f;
            }
            else if (id != 0 && id != maxIdx 
                     && stimulations[id] >= stimulations[(id - 1)]
                     && stimulations[id] >= stimulations[(id + 1)]) 
           {
               if ((1.0f - (fabs((currentOutput[(id - 1)] -  currentOutput[id])) / range[0])) < stimulations[(id - 1)])
                    positionToInsert[0] = (float)id - 0.01f;
                    else
                positionToInsert[0] = (float)id + 0.99f;
            }
        }*/

和

    if (positionToInsert[0] != -1.0f) 
    {
        float temp = 0.0f;
        /*if ((float)id>positionToInsert[0]) 
        {
            temp = currentOutput[id];
            barrier(CLK_LOCAL_MEM_FENCE);
            currentOutput[id + 1] = temp;
        }
        else 
        {
            barrier(CLK_LOCAL_MEM_FENCE);
        }*/
        barrier(CLK_LOCAL_MEM_FENCE);

        if (id == round(positionToInsert[0])) 
        {
            currentOutput[id] = addedValue[0];
            noOfValuesAdded[0] = noOfValuesAdded[0] + 1;
        }
    }

更新：修复障碍后，算法正常工作，直到大小超过 768（奇怪的是我的 gpu 上的核心数量的 2 倍）。我期待它最多可以工作 1024 个元素，这是最大的工作组大小。我错过了什么吗？

【问题讨论】：

我发布了已更改的代码的截断版本以供调查以促进响应，因为它不太容易阅读。

标签： opencl gpu nvidia gpgpu

【解决方案1】：

warp 中的所有工作项都以锁步方式执行相同的指令。 Nvidia 上的 Warp 大小为 32 个工作项。如果内核可以正常运行多达 32 个工作项，这表明屏障有问题。

barrier 的文档说：

在处理器上执行内核的工作组中的所有工作项必须在任何允许继续之前执行此功能越界执行。

我可以看到这是您内核中的问题。例如这里：

if ((float)id>positionToInsert[0]) 
{
    temp = currentOutput[id];
    barrier(CLK_LOCAL_MEM_FENCE); // <---- some work items may meet here
    currentOutput[id + 1] = temp;
}
else 
{
    barrier(CLK_LOCAL_MEM_FENCE); // <---- other work items may meet here
}

您可以通过以下方式解决此问题：

if ((float)id>positionToInsert[0]) 
    temp = currentOutput[id];
barrier(CLK_LOCAL_MEM_FENCE); // <---- here all work items meet at the same barrier
if ((float)id>positionToInsert[0]) 
    currentOutput[id + 1] = temp;

【讨论】：

谢谢。项目启动时，它是用 Java 框架 Aparapi 编写的，文档声称只有每个内核对 localBarrier 的调用次数必须相同（github.com/aparapi/aparapi/blob/master/doc/UsingLocalMemory.md）

【解决方案2】：

修复障碍后，算法可以正常工作，直到大小超过 768（奇怪的是我的 gpu 上的核心数是 2 倍）。我期待它最多可以工作 1024 个元素，这是最大的工作组大小。我错过了什么吗？

【讨论】：

CL_KERNEL_WORK_GROUP_SIZE 返回 1024
你能用 nvprof --print-gpu-trace 运行并查看 Block 和 Grid 的尺寸吗？