CUDA 推力中的索引数组答案

【问题标题】：Index array in CUDA thrustCUDA 推力中的索引数组
【发布时间】：2018-03-30 11:27:55
【问题描述】：

我正在尝试弄清楚如何在 CUDA 推力中使用索引数组。我的问题如下：

vector<int> index(20);
vector<float> data1(100), data2(100), result(20);
for(int i=0;i<index.size();++i)
   result.push_back(do_something(data1[index[i]],data2[index[i]]));

函数 do_something() 从索引数组选择的几个大数组中获取元素。索引的大小通常远小于数据的大小，索引的元素是有序的。

我不知道什么是有效地执行此操作的最佳策略。

【问题讨论】：

请发帖minimal reproducible example
我无法发布示例，因为我不知道该怎么做。这是问题的本质。
这是一个非常基本的推力用法问题，我建议阅读thrust quick start guide以了解基本用法。请求的 MCVE 将帮助其他人帮助您，因为如果您概述了 do_something 所做的典型示例（例如，按元素添加 data1 和 data2），那么其他人可能会向您展示如何在推力算法中实现这一点。

标签： cuda thrust

【解决方案1】：

我建议阅读thrust quick start guide 以了解推力的基本知识，以及我将在下面使用的一些概念。

我不知道什么是有效地执行此操作的最佳策略。

您在示例中的大部分内容应该直接映射到简单的推力操作。您的 for 循环和 do_something 操作将被推力 algorithm 替换，可能使用适当的仿函数定义来模仿您在 do_something 中的功能。一个易于使用的推力算法是thrust::transform()，可能适用于这种情况。

使用index 数组进行间接寻址通常会使用推力permutation_iterator 来处理，该推力正是为此目的而设计的。

结合这些概念，并假设您的do_something 操作将对两个输入向量中每个索引元素的平方求和，然后将结果的平方根存储在result 向量中，我们假设可以有这样的例子：

$ cat t74.cu
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/copy.h>
#include <math.h>
#include <iostream>

struct do_something
{
  template <typename T>
  __host__ __device__ T operator()(const T &i1, const T &i2){
    return sqrtf(i1*i1 + i2*i2);
  }
};


int main(){
  //pythagorean triples
  float d1[] = { 3,  5,  8,  7, 20, 12,  9, 28};
  float d2[] = { 4, 12, 15, 24, 21, 35, 40, 45};
  int i1[] = {1, 3, 5, 7};
  const size_t isize = sizeof(i1)/sizeof(i1[0]);
  const size_t dsize = sizeof(d1)/sizeof(d1[0]);
  thrust::device_vector<int> index(i1, i1+isize);
  thrust::device_vector<float> data1(d1, d1+dsize);
  thrust::device_vector<float> data2(d2, d2+dsize);
  thrust::device_vector<float> result(isize);
  thrust::transform(thrust::make_permutation_iterator(data1.begin(), index.begin()), thrust::make_permutation_iterator(data1.begin(), index.end()), thrust::make_permutation_iterator(data2.begin(), index.begin()), result.begin(), do_something());
  thrust::copy_n(result.begin(), result.size(), std::ostream_iterator<float>(std::cout, ","));
  std::cout << std::endl;
}
$ nvcc -arch=sm_30 -o t74 t74.cu
$ ./t74
13,25,37,53,
$

对于这种用法，不需要对索引集进行排序（尽管从性能角度来看它可能有些好处）。

【讨论】：