CUDA 函数指针答案

【问题标题】：CUDA function pointersCUDA 函数指针
【发布时间】：2013-03-16 16:08:59
【问题描述】：

我试图在 CUDA 中做这样的事情（实际上我需要编写一些集成函数）

#include <iostream>
using namespace std;

float f1(float x) {
    return x * x;
}

float f2(float x) {
    return x;
}

void tabulate(float p_f(float)) {
    for (int i = 0; i != 10; ++i) {
        std::cout << p_f(i) << ' ';
    }
    std::cout << std::endl;
}

int main() {
    tabulate(f1);
    tabulate(f2);
    return 0;
}

输出：

0 1 4 9 16 25 36 49 64 81
0 1 2 3 4 5 6 7 8 9

我尝试了以下但只得到错误

错误：sm_1x 不支持函数指针和函数模板参数。

float f1(float x) {
    return x;
}

__global__ void tabulate(float lower, float upper, float p_function(float), float* result) {
    for (lower; lower < upper; lower++) {
        *result = *result + p_function(lower);
    }
}

int main() {
    float res;
    float* dev_res;

    cudaMalloc( (void**)&dev_res, sizeof(float) ) ;

    tabulate<<<1,1>>>(0.0, 5.0, f1, dev_res);
    cudaMemcpy(&res, dev_res, sizeof(float), cudaMemcpyDeviceToHost);

    printf("%f\n", res);
    /************************************************************************/
    scanf("%s");

    return 0;
}

【问题讨论】：

你用的是什么卡？您似乎将代码编译为计算能力 1.x，我认为函数指针是计算能力 2.x 的功能。您可以将您的 nvcc 调用更改为 -gencode arch=compute_20,code=sm_20（如果您的卡支持）
@alrikai GeForce 560Ti
那你应该把你的编译从 1.x 改成 2.x，这样就可以摆脱你的编译错误了。但是，您可能仍然会遇到一些运行时问题...
@alrikai 哦，好吧，但是有没有办法在 1.x 中制作这样的东西？
我不这么认为，看来您需要一个指向设备函数的函数指针，并且根据 CUDA 编程指南：“仅在编译的设备代码中支持指向设备函数的函数指针适用于计算能力 2.x 及更高版本的设备。”您的 560Ti 是计算能力 2.1，因此如果您更改为 -gencode arch=compute_20,code=sm_20 进行编译，它对您来说是可行的

标签： cuda function-pointers

【解决方案1】：

这是一个简单的函数指针类，可以从我根据this 问题编写的内核中调用：

template <typename T>
struct cudaCallableFunctionPointer
{
public:
  cudaCallableFunctionPointer(T* f_)
  {
    T* host_ptr = (T*)malloc(sizeof(T));
    cudaMalloc((void**)&ptr, sizeof(T));

    cudaMemcpyFromSymbol(host_ptr, *f_, sizeof(T));
    cudaMemcpy(ptr, host_ptr, sizeof(T), cudaMemcpyHostToDevice);
    
    cudaFree(host_ptr)
  }

  ~cudaCallableFunctionPointer()
  {
    cudaFree(ptr);
  }

  T* ptr;
};

你可以这样使用它：

__device__ double func1(double x)
{
    return x + 1.0f;
}

typedef double (*func)(double x);
__device__ func f_ = func1;



__global__ void test_kernel(func* f)
{
    double x = (*f)(2.0);
    printf("%g\n", x);
}



int main()
{
    cudaCallableFunctionPointer<func> f(&f_);

    test_kernel << < 1, 1 >> > (f.ptr);
}

输出：

【讨论】：

host_ptr 上没有内存泄漏。你从来没有打电话给free。为什么要使用 malloc 而不是仅仅将对象放在堆栈上？

【解决方案2】：

要消除编译错误，您必须在编译代码时使用-gencode arch=compute_20,code=sm_20 作为编译器参数。但是你可能会遇到一些运行时问题：

摘自 CUDA 编程指南http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#functions

主机代码支持指向__global__ 函数的函数指针，但设备代码不支持。仅在为计算能力 2.x 及更高版本的设备编译的设备代码中支持指向 __device__ 函数的函数指针。

主机代码中不允许使用__device__函数的地址。

所以你可以有这样的东西（改编自“FunctionPointers”示例）：

//your function pointer type - returns unsigned char, takes parameters of type unsigned char and float
typedef unsigned char(*pointFunction_t)(unsigned char, float);

//some device function to be pointed to
__device__ unsigned char
Threshold(unsigned char in, float thresh)
{
   ...
}

//pComputeThreshold is a device-side function pointer to your __device__ function
__device__ pointFunction_t pComputeThreshold = Threshold;
//the host-side function pointer to your __device__ function
pointFunction_t h_pointFunction;

//in host code: copy the function pointers to their host equivalent
cudaMemcpyFromSymbol(&h_pointFunction, pComputeThreshold, sizeof(pointFunction_t))

然后您可以将h_pointFunction 作为参数传递给您的内核，内核可以使用它来调用您的__device__ 函数。

//your kernel taking your __device__ function pointer as a parameter
__global__ void kernel(pointFunction_t pPointOperation)
{
    unsigned char tmp;
    ...
    tmp = (*pPointOperation)(tmp, 150.0)
    ...
}

//invoke the kernel in host code, passing in your host-side __device__ function pointer
kernel<<<...>>>(h_pointFunction);

希望这有点道理。总而言之，您似乎必须将 f1 函数更改为 __device__ 函数并遵循类似的过程（不需要 typedef，但它们确实使代码更好）才能将其作为有效的函数指针在主机端传递给您的内核。我还建议查看 FunctionPointers CUDA 示例

【讨论】：

除了上述答案 (+1) 之外，您可能对 NVIDIA 论坛的这个线程中的如何在设备代码中使用函数指针（但不使用模板）的非常简单的示例感兴趣：devtalk.nvidia.com/default/topic/457094/how-can-i-use-设备-function-pointer-in-cuda-/
@njuffa 不错！你的例子更干净（和完整）
@njuffa 在 alrikai 的回答中，设备函数指针可以在内核中直接访问。创建主机函数指针，从符号复制然后将其作为内核参数传递有什么意义？
@zindarod 不明白你的意思。在我在 2013/5/27 上面的帖子中指出的示例代码中，函数指针位于 device 上：__device__ op_func func[3] = { add_func, mul_func, div_func };
@njuffa 是的，你的例子很有意义。但是如果你看上面的答案，当pComputeThreshold可以在内核中直接调用时，h_pointFunction有什么意义呢？

【解决方案3】：

即使您可以编译此代码（请参阅@Robert Crovella 的回答），此代码也不起作用。您不能从主机代码传递函数指针，因为主机编译器无法确定函数地址。

【讨论】：