是否可以为每个线程分配一部分共享内存？答案

【问题标题】：Is it possible to dedicate a portion of shared memory to each thread?是否可以为每个线程分配一部分共享内存？
【发布时间】：2020-02-11 07:31:54
【问题描述】：

我在 Ubuntu 18.04 上使用 CUDA 10.1，我想知道是否可以将一部分共享内存专用于每个线程。我的意思是，比方说，我想为每个线程提供一个相对较大的数组，该数组不适合寄存器，并且将数组放入共享内存会引入竞争条件，因为块上的每个线程都可以访问它。因此，使用一部分共享内存作为“寄存器”对我来说非常有趣，这样我就可以避免竞争条件。我知道这将通过限制我可以在 SM 中使用的线程数来限制我的占用，但在我的情况下，这种权衡是值得的。任何帮助表示赞赏！

【问题讨论】：

标签： multithreading cuda gpgpu

【解决方案1】：

不，不可能将共享内存分配（静态或动态）专用分配给特定线程。共享内存只有块作用域。

然而，完全有可能设计一种索引方案，使块内的线程专门使用块范围共享内存分配内的唯一位置。例如：

template<int nthreads, int words_per_thread>
__global__
void kernel(..)
{
    __shared__ int buffer[nthreads * words_per_thread];

    int* localbuff = &buffer[threadIdx.x * words_per_thread];

    // localbuff is now safely indexed from [0] to [words_per_thread-1]

}

另一种可能的方法是这样的：

#include <stdio.h>
template<typename T>
class sr_
{
  T *sp;
  public:
  __device__
  sr_(T *_sd) { sp = _sd;}
  __device__
  T &operator[](int idx){return sp[blockDim.x*idx+threadIdx.x];}
};
// for statically allocated shared memory
#define SREG(size,type,block_size) \
  __shared__ type SREG_sdata[size*block_size]; \
  typedef type SREG_type; \
  sr_<SREG_type> sreg(SREG_sdata);
// for dynamically allocated shared memory
#define DSREG(type) \
  __shared__ type SREG_sdata[]; \
  typedef type SREG_type; \
  sr_<SREG_type> sreg(SREG_sdata);

const int BS = 8;

__global__ void k2(){
  SREG(8,float,BS)
  sreg[0] = 1.0f;
  printf("%f\n", sreg[0]);
}

int main(){

  k2<<<1,BS>>>();
  cudaDeviceSynchronize();
}

它的好处是一个线程不可能索引到另一个线程的空间，并且不会有任何银行冲突。请注意，这不会处理所有用例。例如，如果同一模块中有多个内核使用不同的数据类型作为共享内存，则必须对其进行修改。

【讨论】：