Mex Cuda 动态分配/慢速 mex 代码答案

【问题标题】：Mex Cuda Dynamic Allocation / Slow mex codeMex Cuda 动态分配/慢速 mex 代码
【发布时间】：2014-09-12 14:46:16
【问题描述】：

我有返回 C++ 主机端数组的 cuda/C++ 代码。我想在 MATLAB 中操作这些数组，所以我以 mex 格式重写了我的代码并用 mex 编译。

我通过将预先分配的数组从 MATLAB 传递到 mex 脚本来使其工作，但这极大地减慢了速度。（54 秒 vs 14 秒没有墨西哥）

这是我的代码的简化、无输入 1 输出版本的慢速解决方案：

#include "mex.h"
#include "gpu/mxGPUArray.h"
#include "matrix.h"
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include "curand.h"
#include <cuda_runtime.h>
#include "math.h"
#include <curand_kernel.h>
#include <time.h>
#include <algorithm>
#include <iostream>

#define iterations 159744
#define transMatrixSize 2592 // Just for clarity. Do not change. No need to adjust this value for this simulation.
#define reps 1024 // Is equal to blocksize. Do not change without proper source code adjustments.
#define integralStep 13125  // Number of time steps to be averaged at the tail of the Force-Time curves to get Steady State Force

__global__ void kern(float *masterForces, ...)
{

int globalIdx = ((blockIdx.x + (blockIdx.y * gridDim.x)) * (blockDim.x * blockDim.y)) + (threadIdx.x + (threadIdx.y * blockDim.x));
...

  ...
   {
...
      {
          masterForces[i] = buffer[0]/24576.0;
      }

      }
   }
...
}



}


void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, mxArray const *prhs[])
{
   ...

plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);


//Device input vectors
float *d_F0;

..
// Allocate memory for each vector on GPU
cudaMalloc((void**)&d_F0, iterations * sizeof(float));
...




//////////////////////////////////////////////LAUNCH ////////////////////////////////////////////////////////////////////////////////////

kern<<<1, 1024>>>( d_F0);



//////////////////////////////////////////////RETRIEVE DATA ////////////////////////////////////////////////////////////////////////////////////


cudaMemcpyAsync( h_F0 , d_F0 , iterations * sizeof(float), cudaMemcpyDeviceToHost);



///////////////////Free Memory///////////////////


cudaDeviceReset();
////////////////////////////////////////////////////

}

为什么这么慢？

编辑：Mex 使用旧架构 (SM_13) 而不是 SM_35 进行编译。现在是时候了。（16 秒使用 mex，14 秒使用 c++/cuda）

【问题讨论】：

您指的是哪个 MathWorks 示例？
标准 cuda mex 示例“timestwo”mathworks.com/help/distcomp/…
该示例采用gpuArray 输入并返回gpuArray output。您想将常规数组输入/输出，对吗？
查看我的更新答案。另外，在使用mxCreateNumericMatrix 时删除delete h_F0;。
我看到的确实没什么可清理的。确保不要计时第一次运行。

标签： c++ matlab dynamic cuda mex

【解决方案1】：

如果 CUDA 代码的输出是普通旧数据 (POD) 主机端（与设备端）数组，则无需使用 mxGPUArray，例如 Forces1 数组 @987654323 @s 使用 new 创建。您引用的 MathWorks 示例可能演示了 MATLAB 的 gpuArray 和内置 CUDA 功能的使用，而不是如何在 MEX 函数中将数据传入和传出常规 CUDA 函数。

如果您可以在 CUDA 函数之外（例如在 mexFunction 中）初始化 Forces1（或完整代码中的 h_F0），那么解决方案就是从new 到mxCreate* 函数之一（即mxCreateNumericArray、mxCreateDoubleMatrix、mxCreateNumericMatrix 等），然后将数据指针传递给您的CUDA 函数：

plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);
// myCudaWrapper(...,h_F0 ,...) /* i.e. cudaMemcpyAsync(h_F0,d_F0,...)

因此，对代码的唯一更改是：

替换：

float *h_F0 = new float[(iterations)];

与

plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);

删除：

delete h_F0;

注意：如果您的 CUDA 代码拥有输出主机端数组，那么您必须将数据复制到mxArray。这是因为除非您使用 mx API 分配 mexFunction 输出，否则您分配的任何数据缓冲区（例如使用 mxSetData）都不会由 MATLAB 内存管理器处理，并且您将出现段错误或最多，内存泄漏。

【讨论】：

如果我在 MATLAB 中初始化 Forces1 并将其传递给 mex 函数，与在 cuda/C++ 脚本中初始化 Forces1 相比，我会遭受性能损失吗？
没有。当您将float* 传递给您的“CUDA/C++ 函数”时，它只是一个常规缓冲区。 mxArray 只是容器。
我正在尝试您的解决方案
大概你有一个 CUDA (C) 函数，带有一个 C++ 包装器，将输出设备阵列复制到输出主机阵列中？如果是这样，这是对 MATLAB 拥有的输出数组 Forces1 的唯一操作。
是的。它在主机上创建一个数组，运行一个内核，然后将内核的结果复制到主机数组中