使用 CUDA 对矩阵行中的每个元素进行排名答案

【问题标题】：Rank of each element in a matrix row using CUDA使用 CUDA 对矩阵行中的每个元素进行排名
【发布时间】：2017-06-17 19:26:56
【问题描述】：

有没有什么方法可以使用 CUDA 或 NVidia 提供的任何函数单独查找矩阵行中元素的秩？

【问题讨论】：

您能否更详细地描述您的问题？
问题详情：例如：Row elements = [4,1,7,1]，ranks = [1,0,2,0] 相同的rank会被赋予相同的值。

标签： cuda pycuda

【解决方案1】：

我不知道 CUDA 或我熟悉的任何库中的内置排名或 argsort 函数。

您当然可以使用推力从较低级别的操作中构建这样的功能。

以下是使用推力的可能解决方法的（未优化的）大纲：

$ cat t84.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/sort.h>
#include <thrust/sequence.h>
#include <thrust/functional.h>
#include <thrust/adjacent_difference.h>
#include <thrust/transform.h>
#include <thrust/iterator/permutation_iterator.h>
#include <iostream>

typedef int mytype;

struct clamp
{
  template <typename T>
  __host__ __device__
  T operator()(T data){
    if (data == 0) return 0;
    return 1;}
};

int main(){

  mytype data[]  = {4,1,7,1};
  int dsize = sizeof(data)/sizeof(data[0]);
  thrust::device_vector<mytype> d_data(data, data+dsize);
  thrust::device_vector<int> d_idx(dsize);
  thrust::device_vector<int> d_result(dsize);

  thrust::sequence(d_idx.begin(), d_idx.end());

  thrust::sort_by_key(d_data.begin(), d_data.end(), d_idx.begin(), thrust::less<mytype>());
  thrust::device_vector<int> d_diff(dsize);
  thrust::adjacent_difference(d_data.begin(), d_data.end(), d_diff.begin());
  d_diff[0] = 0;
  thrust::transform(d_diff.begin(), d_diff.end(), d_diff.begin(), clamp());
  thrust::inclusive_scan(d_diff.begin(), d_diff.end(), d_diff.begin());

  thrust::copy(d_diff.begin(), d_diff.end(), thrust::make_permutation_iterator(d_result.begin(), d_idx.begin()));
  thrust::copy(d_result.begin(), d_result.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
}

$ nvcc -arch=sm_61 -o t84 t84.cu
$ ./t84
1,0,2,0,
$

【讨论】：

谢谢。为什么没有优化？如果我没记错的话，你的解决方案是基于向量的。由于我想在矩阵行中执行上述任务，您的解决方案是否适用于这种情况？我可以在 pyCUDA 中使用它吗？
它是非优化的，因为我没有考虑过创建这样一个函数的所有不同方法，所以我想还有更优化的方法。即使有显示，也可能巧妙地使用推力融合来提高性能。概述的方法是试图展示如何实现行排名功能，作为概念草图。如果您想将其扩展为一次在矩阵行上工作，我想它可以完成，因为推力操作可以通过这种方式扩展（查看推力示例）。关于 pyCUDA，如果你用谷歌搜索“thrust pycuda”，你会发现互操作示例。

【解决方案2】：

如果您在 CUDA 中，则概念排名与其他语言（如 openmp 或 mpi）中的概念排名不同。在这种情况下，您将需要使用 global 代码块来处理 threadIdx.x 和 blockIdx.x 参数

【讨论】：