排序数组索引的并行计算答案

【问题标题】：parallel computing about sorted array's index排序数组索引的并行计算
【发布时间】：2016-05-19 08:21:55
【问题描述】：

我遇到了一个关于使用 cuda 计算一个排序数组中成员的第一个索引的问题，例如，如果给定一个排序数组 [1,1,2,2,5,5,5]，我需要返回 0（1 的第一个索引）、2（2 的第一个索引）、4（5 的第一个索引）。有没有一些并行的方法来解决这个问题？

【问题讨论】：

标签： parallel-processing cuda

【解决方案1】：

执行此操作的一种可能方法是：

使用adjacent difference 方法（每个并行线程查看其元素及其邻居）来识别每个子序列的开始。与其相邻元素相比没有差异的元素不是子序列的开始。与相邻元素不同的元素代表子序列的开始（或结束，或开始+结束）。
一旦识别出每个子序列的开始，使用stream compaction 方法将给定序列缩减为仅表示每个子序列开始的元素序列。流压缩也可以并行完成，典型的方法是使用并行前缀和来识别压缩序列中每个元素的目标地址。

上述算法的第一部分将相当容易直接为其编写 CUDA 代码。第二部分会涉及更多一点，因为parallel prefix sum 编写起来有点复杂。此外，对于并行前缀求和、并行归约、排序等算法。我从不建议有人从头开始编写这些算法。如果可能，您应该始终使用库实现。

因此，建立在 CUDA 之上的the thrust library 提供了一组例程，允许以简单的方法对此类解决方案进行原型设计：

$ cat t1200.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/adjacent_difference.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>

typedef int mytype;

using namespace thrust::placeholders;

int main(){

  mytype data[] = {1,1,2,2,5,5,5};
  int dsize = sizeof(data)/sizeof(data[0]);

  thrust::device_vector<mytype> d_data(data, data+dsize);
  thrust::device_vector<mytype> d_diffs(dsize);
  thrust::adjacent_difference(d_data.begin(), d_data.end(), d_diffs.begin());
  thrust::device_vector<int> d_result(dsize);
  int rsize = thrust::copy_if(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(dsize), d_diffs.begin(), d_result.begin(), _1 > 0) - d_result.begin();
  thrust::copy_n(d_result.begin(), rsize, std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}

$ nvcc -o t1200 t1200.cu
$ ./t1200
0,2,4,
$

根据输入数据的确切组成，可能需要处理各种极端情况。上面的代码只是一个简单的例子来演示一种可能的方法。例如，如果排序序列中的第一个元素为零或负数，则需要稍微修改上述代码。由于输入数据的第一个元素始终是子序列的开始，因此可以通过额外的代码行轻松处理，将d_diffs 的第一个元素设置为正值，总是，在 copy_if 使用之前。

【讨论】：

我正在考虑使用排序后的数据作为键进行扫描。这可能比这更快