相当于 Matlab 在 CUDA 中的 find 命令答案

【问题标题】：Equivalent of Matlab's find command in CUDA相当于 Matlab 在 CUDA 中的 find 命令
【发布时间】：2012-08-20 13:04:41
【问题描述】：

我想尽快找到矩阵的非零元素。考虑到CUDA\Jacket，我了解到这比Matlab 的“常规”CPU 版本的查找速度要慢得多，可能是由于内存分配问题，因为输出的大小不是在查找功能之前已知。然而，使用bwlabel 和regionprops（Jacket 都支持）确实有效地产生了关于非零元素的信息，并且比Matlab 的内置图像处理工具箱函数快得多。有没有办法利用它来获得非零元素？有没有办法对使用bwlabel 找到的每个标记对象进行一些处理？

【问题讨论】：

@nate，你能发布一些关于你在做什么以及你是如何进行基准测试的代码吗？ find 是 Jacket 中更快的功能之一，您应该不会遇到任何麻烦。如果您使用的是稀疏矩阵，还要提及。
@pavan ，请参阅下面我对 gpu 的回复。夹克很好，只要你喂它足够大的矩阵。我不知怎的忘记了……

标签： matlab cuda find jacket

【解决方案1】：

根据我的经验，Jacket 支持的 FIND 实现非常快，至少对于大于 300x300 左右的矩阵。我在笔记本电脑上对此进行了测试，并在下面分享了结果。我的硬件规格是：

>> ginfo
Jacket v2.2 (build 77be88c) by AccelerEyes (64-bit Windows)
License: Standalone (C:\Program Files\AccelerEyes\Jacket\2.2\engine\jlicense.dat)
Addons: MGL16, JMC, SDK, DLA, SLA
CUDA toolkit 4.2, driver 4.2 (296.10)
GPU1 GeForce GT 540M, 2048 MB, Compute 2.1 (single,double)
Display Device: GPU1 GeForce GT 540M
Memory Usage: 1697 MB free (2048 MB total)

CPU 是 Intel Core i7-2630QM。

我知道 Jacket 在 FIND 功能上的 CPU 速度提高了约 3 倍。这是我使用的基准代码：

% time Jacket vs CPU
for n = 5:12;
    x(n) = 2^n;
    Ac = single(rand(x(n)));
    Ag = gsingle(Ac);
    t_cpu(n) = timeit(@() find(Ac > 0.5));
    t_gpu(n) = timeit(@() find(Ag > 0.5));
end

% plot results
plot(x, t_cpu ./ t_gpu);
xlabel('Matrix Edge Size', 'FontSize', 14);
ylabel('Jacket (GPU) Speedup over CPU', 'FontSize', 14);

以下是运行此代码的结果：

我确信 Jacket 支持的 BWLABEL 和 REGIONPROPS 函数也非常快，但是根据上面的基准测试，您也许可以使用 FIND 本身。

【讨论】：

我尴尬地又陷入了一个愚蠢的想法，即向 gpu 提供最大 50x50 大小的小信号块，并认为这样可以更好地解决问题。所以通过多次计算'[idx idy]=find(gdouble(m)>threshold);'其中 m 是串行获得的 10x10 到 50x50 信号矩阵，我使用“timeit”时间，每次查找迭代慢 5-20 倍......相反，累积更大的矩阵确实显示了因子 3 改进（我使用gdouble 一次）。对不起我的愚蠢，再次感谢......

【解决方案2】：

除了 arrayfire 讨论的之外，另一种可能性是使用 CUDA Thrust。下面，我将发布一个简单的示例，其中根据(x, y) 阈值选择二维域上的粒子。我可以很容易地适应张贴者感兴趣的情况，或者更一般地，模拟Matlab 的find 的行为。

代码

#include <thrust/gather.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/functional.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <iostream>

struct isWithinThreshold {

    double2 thresholdPoint;

    __host__ __device__ isWithinThreshold(double2 thresholdPoint_) { thresholdPoint = thresholdPoint_; };

    __host__ __device__ bool operator()(const double2 r) {
        return ((r.x > thresholdPoint.x) && (r.y > thresholdPoint.y));
    }
};

/********/
/* MAIN */
/********/
int main()
{
    const int N = 5;
    double2 thresholdPoint = make_double2(0.5, 0.5);

    thrust::host_vector<double2> particleCoordinates(N);
    particleCoordinates[0].x = 0.45;    particleCoordinates[0].y = 0.4;
    particleCoordinates[1].x = 0.1;     particleCoordinates[1].y = 0.3;
    particleCoordinates[2].x = 0.8;     particleCoordinates[2].y = 0.9;
    particleCoordinates[3].x = 0.7;     particleCoordinates[3].y = 0.9;
    particleCoordinates[4].x = 0.7;     particleCoordinates[4].y = 0.45;

    // --- Find out the indices
    thrust::host_vector<int> indices(N);
    thrust::host_vector<int>::iterator end = thrust::copy_if(thrust::make_counting_iterator(0),
        thrust::make_counting_iterator(N),
        particleCoordinates.begin(),
        indices.begin(),
        isWithinThreshold(thresholdPoint));
    int size = end - indices.begin();
    indices.resize(size);

    // --- Fetch values corresponding to the selected indices
    thrust::host_vector<double2> values(size);
    thrust::copy(thrust::make_permutation_iterator(particleCoordinates.begin(), indices.begin()),
        thrust::make_permutation_iterator(particleCoordinates.end(), indices.end()),
        values.begin());

    for (int k = 0; k < size; k++)
        printf("k = %d; index = %d; value.x = %f; value.y = %f\n", k, indices[k], values[k].x, values[k].y);

    return 0;
}

【讨论】：