CUDA强制指令执行顺序答案

【问题标题】：CUDA force instruction execution orderCUDA强制指令执行顺序
【发布时间】：2022-01-02 04:59:20
【问题描述】：

我正在尝试将一些数据操作从 CPU 传输到 GPU (CUDA)，但有一小部分需要以特定顺序运行指令。原则上我可以做前几个并行部分，然后将结果传输到主机的串行部分，然后再将其传输回其余的并行部分，但我试图避免内存传输开销。

计算的串行部分的形式为：

for (int i = 2; i<size; i++)
{
    result[i] = oldArray[i] + result[i-1];
}

除了在单个线程上启动内核以进行此计算之外，还有其他方法可以强制线程或计算以特定顺序运行吗？

编辑：

这个问题比我第一次展示的要复杂一些，据我所知，它不能作为前缀和问题。

循环实际上是这样的：

for (int i = 2; i<size; i++)
{
    result[i] = oldArray[i] + k * result[i-1];
}

我一直在查看 Thrust 库的文档，但这似乎没有解决方案。但是，我可能只是不明白我在看什么。这类问题有并行解决方案吗？

【问题讨论】：

标签： cuda thrust

【解决方案1】：

我们可以对此类问题给出的一种可能的描述是将它们归入递归关系的类别中。

原问题：

for (int i = 2; i<size; i++)
{
    result[i] = oldArray[i] + result[i-1];
}

可以通过oldArray 上的前缀和轻松解决，如果需要遵循this question/answer 中给出的描述。

随着编辑的修改：

for (int i = 2; i<size; i++)
{
    result[i] = oldArray[i] + k * result[i-1];
}

我们必须做额外的工作。参考先前链接的答案，在该答案的底部，Blelloch 提供了对this paper 的参考。如果我们研究那篇论文的第 1.4 节，我们可以观察到这个新问题的表述符合第 1.4.1 节中描述的“一阶递归”模式，特别是公式 1.5。如果我们仔细指定输入/输出数据以及扫描操作符，这里提供了一个证明，说明如何使用 scan 操作来实现该公式的解决方案。

Thrust 能够支持对所提供的基本扫描进行此类概括。论文中s和c所指的对集合可以实现为thrust::tuple，可以将具体的算子传递给推力扫描操作，对操作行为进行泛化。

我不会试图涵盖该论文的所有内容；我们大多只需要关注第 48 页和第 49 页提供的材料。

接下来是一个使用推力的示例，证明我们可以使用与论文中描述的完全相同的推力扫描操作来解决这个问题公式。下面的代码用 cmets 注释，引用了 Blelloch 论文中的特定公式：

$ cat t1929.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/scan.h>
#include <thrust/iterator/zip_iterator.h>
#include <cstdlib>

template <typename T>
void cpufunction(T *result, T *oldArray, size_t size, T k){
  for (int i = 1; i<size; i++)
  {
    result[i] = oldArray[i] + k * result[i-1];
  }
}

struct scan_op // as per blelloch (1.7)
{
  template <typename T1, typename T2>
  __host__ __device__
  T1 operator()(const T1 &t1, const T2 &t2){
    T1 ret;
    thrust::get<0>(ret) = thrust::get<0>(t1)*thrust::get<0>(t2);
    thrust::get<1>(ret) = thrust::get<1>(t1)*thrust::get<0>(t2)+thrust::get<1>(t2);
    return ret;
    }
};

typedef float mt;
const size_t ds = 1048576;
const mt k = 1.01;
int main(){

  mt *b  = new mt[ds]; // b as in blelloch (1.5)
  mt *a  = new mt[ds]; // a as in blelloch (1.5)
  mt *cr = new mt[ds]; // cpu result
  for (int i = 0; i < ds; i++) { a[i] = k; b[i] = rand()/(float)RAND_MAX;}
  cr[0] = b[0];
  cpufunction(cr, b, ds, k);
  for (int i = 0; i < 10; i++) std::cout << cr[i] << ",";
  std::cout << std::endl;
  thrust::device_vector<mt> db(b, b+ds);
  thrust::device_vector<mt> da(a, a+ds);
  thrust::device_vector<mt> dy(ds);
  thrust::device_vector<mt> dx(ds);
  thrust::inclusive_scan(thrust::make_zip_iterator(thrust::make_tuple(da.begin(), db.begin())), thrust::make_zip_iterator(thrust::make_tuple(da.end(), db.end())), thrust::make_zip_iterator(thrust::make_tuple(dy.begin(), dx.begin())), scan_op());
  thrust::host_vector<mt> hx = dx;
  thrust::copy_n(hx.begin(), 10, std::ostream_iterator<mt>(std::cout, ","));
  std::cout << std::endl;
}
$ nvcc -std=c++14 t1929.cu -o t1929
$ ./t1929
0.840188,1.24297,2.0385,2.85733,3.79755,4.03307,4.40863,5.22094,5.55093,6.16041,
0.840188,1.24297,2.0385,2.85733,3.79755,4.03307,4.40863,5.22094,5.55093,6.16041,

Blelloch 描述的一阶递归允许或多或少任意a 数组的可能性。在这个问题中，a 数组简单地由k、k、k 给出......我们可以通过消除a 数组并将其替换为thrust::constant_iterator 来进一步简化这个问题。该练习相当机械，留给读者。

【讨论】：

您可能根本不需要使用thrust::device_vector。推力操作也可以处理“普通”设备数据。有两种方法可以做到这一点，或者作为“原始”数据，或者通过转换为thrust::device_pointer 的指针。在thrust 标签上有关于如何执行这些操作的问题和示例。对于zip_iterator 的用法，thrust::device_pointer 应该没问题。您可以将普通的“原始”指针转换为 thrust::device_pointer
我没有做过任何仔细的测试。当我运行最后一个案例的代码时，我在你发布的第二个问题中发布了这个案例，那个案例（仅推力内核，ds=256000，V100 GPU，float 数据）我得到了大约 80 微秒。这对我来说似乎相当快，因为内核启动需要大约 5 微秒。串行代码有可能（我想）快于 80 微秒。我没有计时。
当我运行我在另一个答案中发布的最后一个案例时，ds 为 256k 和 double，在 Maxwell GTX 960 GPU（比你的 Pascal P22200 更小/更弱）和时间仅调用thrust::inclusive_scan，cpufunction 得到 972 微秒，thrust::inclusive_scan 得到 967 微秒。我在 GPU 方面担心您正在为 debug 项目计时。如果是这种情况，您不应该这样做。是的，如果你加上数据传输时间和cudaMalloc 时间，GPU 会开始变得更糟。
我不希望 GPU 仅在单个功能上击败 CPU。您以“但有一小部分需要指令以特定顺序运行”来开始讨论。如果这是一个更大的工作流程的一小部分，那么在这里达到平价可能是一个有用的步骤。但是，如果您期望这个操作本身比 CPU 快得多，而您正在执行所有数据移动的开销，那么我会说您可能是在浪费时间。
当我切换到float时，我报告的上述时间变为：CPU：315us，GPU：183us。