【问题标题】:How to transpose a huge arbitrary matrix in cuda using shared memory?如何使用共享内存在 cuda 中转置一个巨大的任意矩阵?
【发布时间】:2019-10-05 02:45:06
【问题描述】:

我的任务是使用共享内存在 CUDA 中转置矩阵,而不会发生存储库冲突。限制是:with*height

我尝试了Matrix Transpose (with shared Memory) with arbitary size on Cuda C 此处提供的解决方案,但它对我没有帮助,因为我的矩阵大小太大并且超出了 CUDA 维度限制(65536 个块和每个块 32 个线程)。

我尝试创建一个循环,这有助于处理巨大的矩阵:

const int BLOCK_DIM = 32;
__global__ void transposeMatrixFast(double* inputMatrix, double* outputMatrix, int width, int height)
{
    __shared__ double temp[BLOCK_DIM][BLOCK_DIM+1];

    int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
    int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
    int offsetx = gridDim.x * blockDim.x;
    int offsety = gridDim.y * blockDim.y;

    for (int y = yIndex; y < height; y += offsety)
    {
        for (int x = xIndex; x < width; x += offsetx)
        {
            if ((xIndex < width) && (yIndex < height))
            {
                int idx = y * width + x;

                temp[threadIdx.y][threadIdx.x] = inputMatrix[idx];
            }

            __syncthreads();


            if ((x < width) && (y < height))
            {
                int idx = x * height + y;

                outputMatrix[idx] = temp[threadIdx.y][threadIdx.x];

            }
        }
    }
}

现在我在测试服务器上收到“超出时间限制”错误。原因是我不能在这一行中使用共享内存的好处: outputMatrix[idx] = temp[threadIdx.x][threadIdx.y]; 我的内核变慢了。我认为还有另一种组织循环的方法,但我不知道如何。

【问题讨论】:

    标签: c++ matrix cuda transpose


    【解决方案1】:

    我找到了另一种组织循环的方法,现在我可以转置任意大小的矩阵:

    const int BLOCK_SIZE = 32;
    __global__ void matrixTransposeSolveBankConflicts(const double *d_a, double *d_b, const unsigned long rows, const unsigned long cols) {
    
        __shared__ double mat[BLOCK_SIZE][BLOCK_SIZE + 1];
    
        unsigned long bh = ceil((double)rows / BLOCK_SIZE);
        unsigned long bw = ceil((double)cols / BLOCK_SIZE);
    
        for (unsigned long blocky = blockIdx.y; blocky < bh; blocky += gridDim.y) {
            for (unsigned long blockx = blockIdx.x; blockx < bw; blockx += gridDim.x) {
                unsigned long bx = blockx * BLOCK_SIZE;
                unsigned long by = blocky * BLOCK_SIZE;
    
                unsigned long i = by + threadIdx.y;
                unsigned long j = bx + threadIdx.x;
    
                if (i < rows && j < cols)
                {
                    mat[threadIdx.x][threadIdx.y] = d_a[i*cols + j];
                }
    
                __syncthreads();
    
                unsigned long ti = bx + threadIdx.y;
                unsigned long tj = by + threadIdx.x;
    
                if (tj < rows && ti < cols)
                {
                    d_b[ti*rows + tj] = mat[threadIdx.y][threadIdx.x];
                }
    
                __syncthreads();
            }
        }
    }
    

    【讨论】:

      猜你喜欢
      • 2019-05-25
      • 2017-04-06
      • 2020-12-22
      • 2017-03-29
      • 1970-01-01
      • 1970-01-01
      • 2021-06-16
      • 2013-09-19
      相关资源
      最近更新 更多