命名空间作为 CUDA 中的模板参数答案

【问题标题】：Namespaces as template parameters in CUDA命名空间作为 CUDA 中的模板参数
【发布时间】：2015-08-10 10:41:17
【问题描述】：

在 C++ 中，不可能将 namespace 作为某种参数（通过模板或实际函数参数）传递给类或函数。这同样适用于 CUDA（至少据我所知）。这个问题解释了一些原因：Why can't namespaces be template parameters?

这是一个用例示例：

namespace experiment1
{
    int repetitions() { return 2; }
    void setup() { ... }
    void f() { ... }
    void teardown() { ... }
}

namespace experiment2
{
    int repetitions() { return 4; }
    void setup() { ... }
    void f() { ... }
    void teardown() { ... }
}

// Beware, this is invalid C++ and invalid CUDA
template<namespace NS>
void do_test()
{
    // Do something with NS::repetitions(), NS::setup(), ...
}

这在 C++ 中无效的一个原因是，在这种方法中没有什么是你不能用类做的。您确实可以将每个命名空间变成一个类，将函数变成成员函数，然后将该类作为模板参数传递给 do_test 函数，或者将它的一个实例作为参数传递给同一个函数（在前一种情况下可能使用静态函数或后一种情况下的虚函数）。

我同意这一点。但是，在 CUDA 的特定情况下，您可以使用命名空间来做一些事情，但不能使用类。想象f 是一个内核，即__global__ 函数，而setup 或另一个函数用于指定，例如为内核分配的共享内存的大小。内核不能是类的成员（请参阅此问题的答案：Can CUDA kernels be virtual functions?）。但是，您可以将它与同一个实验相关的其他函数一起放在同一个namespace 中。

考虑上面代码中显示的情况：do_test 是设置计时器、准备一些输入、检查输出、测量时间和执行一些其他操作的函数。每个实验都是一组具有相同名称和相同接口的几个函数，其中一个是内核。您希望do_test 足够通用以处理所有这些实验。并且您希望每个实验的代码以某种形式的封装（如命名空间、结构、类......

这个问题能解决吗？

应 talonmies 的要求（顺便说一句，非常感谢您的评论），我会让问题更具体。

我有几个非常简单的内核，它们执行类似的操作。它们从一个大数组中加载值，对它们应用模板操作，然后将结果写入输出数组（不同于输入数组）。模板操作是指线程idx 对输入值idx 及其相邻值（例如从idx-3 到idx+3）执行的操作。这些内核中最简单的只执行从输入到输出的复制：每个线程读取input[idx] 并写入output[idx]。另一个例子是执行output[idx] = input[idx+1] - input[idx-1] 的差异模具。（我会留下一些细节，但你明白了。）

我想对这些内核进行基准测试，从而得出一个性能模型。对于每个内核，我还需要一个能够检查结果的主机函数。在每种情况下，我还有另一个内核，它通过优化以稍微不同的方式执行相同的操作，但从结果的角度来看是等效的。最后，我有一个打印内核名称的主机函数。这是代码中的摘要：

namespace copy
{
    std::string name() { return "copy"; }
    __global__ void kernel(const float* input, float* output, int size);
    __global__ void kernelOptimized(const float* input, float* output, int size);
    bool check(const float* input, const float* output);
}

namespace difference
{
    std::string name() { return "difference"; }
    __global__ void kernel(const float* input, float* output, int size);
    __global__ void kernelOptimized(const float* input, float* output, int size);
    bool check(const float* input, const float* output);
}

我有一个函数do_test，我将其参数化为通用：

typedef bool NameFunction(const float* input, const float* output);
typedef bool CheckFunction(const float* input, const float* output);
typedef void KernelFunction(const float* input, float* output, int size);

void do_test(NameFunction name, KernelFunction kernel1, KernelFunction kernel2, CheckFunction check)
{
    // Set up input and output array
    // Set up CUDA events
    // Warm up kernels
    // Run kernels
    // Check results
    // Measure time
    // Do standard output
}

int main()
{
    do_test<copy::name, copy::kernel, copy::kernelOptimized, copy::check>()
    do_test<difference::name, difference::kernel, difference::kernelOptimized, difference::check>()
}

现在，当然这种方式已经很好了。但是，如果我再引入一个每个实验都必须提供的功能，我将需要修改我调用do_test 的所有这些行。我更喜欢传递这个命名空间或某种包含这些函数的对象。

【问题讨论】：

因为 CUDA 内核不能是类的成员函数，甚至不能是静态成员函数。
是的，可以解决。但不是你想象的那样。 CUDA 能够通过运行时 API 加载 GPU 代码。 PyCUDA 围绕这个特性构建了一个完整的 JIT 和元编程生态系统。但我不会给你写一篇多页的论文来解释如何做到这一点。您能否解释一下您在这里试图达到的具体最终目标，而不是用非常通用的编程术语来表达这个问题？这可能会让回答更容易一些。

标签： c++ templates cuda namespaces generic-programming

【解决方案1】：

您可以将内核修改为“只是”__device__ 函数，然后通过 kernel_wrapper 调用：

#include <iostream>
#include <stdio.h>


typedef void (*kernel_ptr)(const float* input, float* output, int size);

template <kernel_ptr kernel>
__global__
void kernel_wrapper(const float* input, float* output, int size)
{
    kernel(input, output, size);
}

struct copy
{
    std::string name() { return "copy"; }
    __device__ static void kernel(const float* input, float* output, int size){ printf("copy: %d\n",threadIdx.x); }
    __device__ static void kernelOptimized(const float* input, float* output, int size){ printf("copy optimized: %d\n",threadIdx.x); }
};

struct difference
{
    std::string name() { return "difference"; }

    __device__ static void kernel(const float* input, float* output,i nt size){ printf("difference: %d\n",threadIdx.x); }
    __device__ static void kernelOptimized(const float* input, float* output, int size){ printf("difference optimized: %d\n",threadIdx.x); }
};

template <typename Experiment>
void do_test()
{
    dim3 dimBlock( 4, 1 );
    dim3 dimGrid( 1, 1 );
    Experiment e;

    std::cout << "running experiment " << e.name() << std::endl;
    std::cout << "launching the normal kernel" << std::endl;
    kernel_wrapper<Experiment::kernel><<<dimGrid, dimBlock>>>(0,0,0);
    cudaDeviceSynchronize();
    std::cout << "launching the optimized kernel" << std::endl;
    kernel_wrapper<Experiment::kernelOptimized><<<dimGrid, dimBlock>>>(0,0,0);
    cudaDeviceSynchronize();
}


int main()
{
    do_test<copy>();
    do_test<difference>();
    return 0;
}

输出：

running experiment copy
launching the normal kernel
copy: 0
copy: 1
copy: 2
copy: 3
launching the optimized kernel
copy optimized: 0
copy optimized: 1
copy optimized: 2
copy optimized: 3
running experiment difference
launching the normal kernel
difference: 0
difference: 1
difference: 2
difference: 3
launching the optimized kernel
difference optimized: 0
difference optimized: 1
difference optimized: 2
difference optimized: 3

或者，您可以结合使用CRTP 和模板特化：

#include <iostream>
#include <stdio.h>


template <typename Experiment>
__global__ void f();

template <typename Derived>
struct experiment
{
    void run()
    {
        int blocksize = static_cast<Derived*>(this)->blocksize();
        int reps = static_cast<Derived*>(this)->repetitions();
        for (int i = 0; i<reps; ++i)
        {
            dim3 dimBlock( blocksize, 1 );
            dim3 dimGrid( 1, 1 );
            f<Derived><<<dimGrid, dimBlock>>>();
        }
        cudaDeviceSynchronize();
    }
};

struct experiment1 : experiment<experiment1>
{
    int repetitions() { return 2; }
    int blocksize() { return 4; }
    experiment1() { std::cout << "setting up experiment 1" << std::endl; }
    ~experiment1() {  std::cout << "shutting down experiment 1" << std::endl;  }
};

template <>
__global__
void f<experiment1>()
{
    printf("experiment1: %d\n",threadIdx.x);
}


struct experiment2 : experiment<experiment2>
{
    int repetitions() { return 4; }
    int blocksize() { return 2; }
    experiment2() { std::cout << "setting up experiment 2" << std::endl; }
    ~experiment2() {  std::cout << "shutting down experiment 2" << std::endl;  }
};

template <>
__global__
void f<experiment2>()
{
    printf("experiment2: %d\n",threadIdx.x);
}

template<typename Experiment>
void do_test()
{
    Experiment e;
    e.run();
}

#include <iostream>
#include <stdio.h>


template <typename Experiment>
__global__ void f();

template <typename Derived>
struct experiment
{
    void run()
    {
        int blocksize = static_cast<Derived*>(this)->blocksize();
        int reps = static_cast<Derived*>(this)->repetitions();
        for (int i = 0; i<reps; ++i)
        {
            dim3 dimBlock( blocksize, 1 );
            dim3 dimGrid( 1, 1 );
            f<Derived><<<dimGrid, dimBlock>>>();
        }
        cudaDeviceSynchronize();
    }
};

struct experiment1 : experiment<experiment1>
{
    int repetitions() { return 2; }
    int blocksize() { return 4; }
    experiment1() { std::cout << "setting up experiment 1" << std::endl; }
    ~experiment1() {  std::cout << "shutting down experiment 1" << std::endl;  }
};

template <>
__global__
void f<experiment1>()
{
    printf("experiment1: %d\n",threadIdx.x);
}


struct experiment2 : experiment<experiment2>
{
    int repetitions() { return 4; }
    int blocksize() { return 2; }
    experiment2() { std::cout << "setting up experiment 2" << std::endl; }
    ~experiment2() {  std::cout << "shutting down experiment 2" << std::endl;  }
};

template <>
__global__
void f<experiment2>()
{
    printf("experiment2: %d\n",threadIdx.x);
}

template<typename Experiment>
void do_test()
{
    Experiment e;
    e.run();
}

int main()
{
    do_test<experiment1>();
    do_test<experiment2>();
    return 0;
}

输出

setting up experiment 1
experiment1: 0
experiment1: 1
experiment1: 2
experiment1: 3
experiment1: 0
experiment1: 1
experiment1: 2
experiment1: 3
shutting down experiment 1
setting up experiment 2
experiment2: 0
experiment2: 1
experiment2: 0
experiment2: 1
experiment2: 0
experiment2: 1
experiment2: 0
experiment2: 1
shutting down experiment 2

【讨论】：

感谢您的编辑和回答。我对这种方法有一个问题：每个实验都调用内核本身。在我的理想设置中，do_test 函数将接收内核并运行它。
@Spiros 我添加了另一种方法
我仔细检查了设备功能的方法，发现它对我来说完全令人满意。 nvcc 生成的 PTX 对于copy（定义为__global__）和wrapper<copy>（其中copy 是__device__ 函数）是相同的。内联在这里完美地工作。由于__device__ 函数可以是成员函数，甚至是静态成员函数，我可以做我想做的事。非常感谢。