第一次从推力执行排序需要太长时间答案

【问题标题】：First time executing sort from thrust it takes too long第一次从推力执行排序需要太长时间
【发布时间】：2021-06-08 19:12:48
【问题描述】：

我正在使用 opengl（实现 sph 算法）开发流体模拟器。我尝试了很多方法来运行我的模拟器，首先我使用八叉树，然后是哈希图，现在我尝试使用 Z 顺序，为此我需要根据它们的索引对我的粒子进行排序。

我难以理解的是，如果我有一个推力::sort 需要 15 毫秒，如果我有两个推力::sort 需要 17 毫秒。

为了更清楚，我在 opengl 中做我的模拟器（我所有的缓冲区都是使用 opengl 创建的），并且我使用 cuda 互操作来对我的缓冲区进行排序，它使用 cuda。

这是我获取缓冲区并将它们“链接”到 cuda 的部分

//I use this if to do the registerBuffer only one time
if (first == 0) {
        //index
        IBuffer* bIndex = RESOURCEMANAGER->getBuffer("particleLib::Index");
        int buffIdIndex = bIndex->getPropi(IBuffer::ID);
        //Position
        IBuffer* bPosition = RESOURCEMANAGER->getBuffer("particleLib::Position");
        int buffIdPosition = bPosition->getPropi(IBuffer::ID);
        //TempIndex
        IBuffer* bTempIndex = RESOURCEMANAGER->getBuffer("particleLib::TempIndex");
        int buffIdTempIndex = bTempIndex->getPropi(IBuffer::ID);
        //Velocity
        IBuffer* bVelocity = RESOURCEMANAGER->getBuffer("particleLib::Velocity");
        int buffIdVelocity = bVelocity->getPropi(IBuffer::ID);

        // register this buffer object with CUDA
        //So devia chamar isto uma vez
        cudaGraphicsGLRegisterBuffer(&cuda_ssbo_Index, buffIdIndex, cudaGraphicsMapFlagsNone);
        cudaGraphicsGLRegisterBuffer(&cuda_ssbo_TempIndex, buffIdTempIndex, cudaGraphicsMapFlagsNone);
        cudaGraphicsGLRegisterBuffer(&cuda_ssbo_Position, buffIdPosition, cudaGraphicsMapFlagsNone);
        cudaGraphicsGLRegisterBuffer(&cuda_ssbo_Velocity, buffIdVelocity, cudaGraphicsMapFlagsNone);
        first = 1;
    }
    

    // map OpenGL buffer object for writing from CUDA
    int* dptrssboIndex;
    int* dptrssboTempIndex;
    float4 * dptrssboPosition;
    float4 * dptrssboVelocity;

    cudaGraphicsMapResources(1, &cuda_ssbo_Index, 0);
    cudaGraphicsMapResources(1, &cuda_ssbo_TempIndex, 0);
    cudaGraphicsMapResources(1, &cuda_ssbo_Position, 0);
    cudaGraphicsMapResources(1, &cuda_ssbo_Velocity, 0);

    size_t num_bytesssbo_Index;
    size_t num_bytesssbo_TempIndex;
    size_t num_bytesssbo_Position;
    size_t num_bytesssbo_Velocity;

    cudaGraphicsResourceGetMappedPointer((void**)&dptrssboIndex, &num_bytesssbo_Index, cuda_ssbo_Index);
    cudaGraphicsResourceGetMappedPointer((void**)&dptrssboTempIndex, &num_bytesssbo_TempIndex, cuda_ssbo_TempIndex);
    cudaGraphicsResourceGetMappedPointer((void**)&dptrssboPosition, &num_bytesssbo_Position, cuda_ssbo_Position);
    cudaGraphicsResourceGetMappedPointer((void**)&dptrssboVelocity, &num_bytesssbo_Velocity, cuda_ssbo_Velocity);

    mysort(&dptrssboIndex,&dptrssboPosition, &dptrssboTempIndex, &dptrssboVelocity,216000);

    cudaGraphicsUnmapResources(1, &cuda_ssbo_Index, 0);
    cudaGraphicsUnmapResources(1, &cuda_ssbo_TempIndex, 0);
    cudaGraphicsUnmapResources(1, &cuda_ssbo_Position, 0);
    cudaGraphicsUnmapResources(1, &cuda_ssbo_Velocity, 0);

这是来自 mysort 的代码

void mysort(int ** index1, float4 ** values1, int** index2, float4 ** values2,int particles){
    
    thrust::device_ptr<int> i1buff = thrust::device_pointer_cast(*(index1));
    thrust::device_ptr<float4> v1buff = thrust::device_pointer_cast(*(values1));
    thrust::device_ptr<int> i2buff = thrust::device_pointer_cast(*(index2));
    thrust::device_ptr<float4> v2buff = thrust::device_pointer_cast(*(values2));

    //sorts
    thrust::stable_sort_by_key(i1buff, i1buff + particles,v1buff); // 15 ms
    //cudaThreadSynchronize();
    thrust::stable_sort_by_key(i2buff, i2buff + particles, v2buff); // 17 ms


    //repetido so para ver o tempo
    thrust::stable_sort_by_key(i1buff, i1buff + particles, v1buff);
    //cudaThreadSynchronize();
    thrust::stable_sort_by_key(i2buff, i2buff + particles, v2buff); //4 sorts -> 19 ms

    //cudaThreadSynchronize();
}

有人可以解释发生了什么吗？

编辑1：我使用 cudaDeviceSynchronize() 来测量每种排序所需的时间（如@Jérôme-Richard 所示），即使我更改订单，第一次排序总是需要更长的时间。另一个事实是，如果我的相机离场景更近，第一种需要更长的时间，这表明 Cuda 可能正在等待 opengl 完成他的工作，从而使第一种“需要更长时间”。我还尝试在我的 mysort() 函数上没有排序，我里面唯一的东西是 cudaDeviceSynchronize() 并且它花了 15 毫秒，这再次表明 cuda 可能正在等待 opengl 从最后一帧完成工作.

编辑2：我做了更多的调试，我认为似乎是真的。真正的减速来自 cudaGraphicsMapResources 调用。据此（cudaGraphicsMapResources slow speed when mapping DirectX texture）：

此函数提供同步保证，即在 cudaGraphicsMapResources() 之前发出的任何图形调用将在流中发出的任何后续 CUDA 工作开始之前完成。

是的，它正在等待 opengl 绘制一些东西，因为 camara 距离会影响 cudaGraphicsMapResources 花费的时间。

【问题讨论】：

我自己也不确定，但我估计第一次排序调用的时间包括其他操作，比如内存分配和传输？您的通话时间究竟是怎样的？
可能会发生一些事情，我假设进入 mysort 的所有值都已经在主内存中，因此随后对该函数的调用或多或少相同的值将导致它们被留下在 cpu 缓存中（比内存快），因此给你一些额外的性能。如果您真的想根据调用次数来衡量性能，请使用迭代次数，例如 1000,2000 ......然后标准化。但是，如果您在进行分析，则与后续调用相关的函数性能不如与整个程序相关的性能重要。
嗨@Beko，排序不应该分配内存，因为缓冲区已经在gpu上。我知道时间，因为我正在使用一个名为 Nau3d 的成本引擎，并且分析器告诉我每个“步骤”（通过）所花费的时间。时间应该是正确的，因为我可以看到排序对 fps 的影响。当我进行 1 次排序时，我的 fps 计数会受到影响，而当我进行 2 次排序时，fps 几乎保持不变。
好的。只是我在使用nvprof 之前也有过类似的经历，在内核调用之后的第一个cudaMemcpy 所用的时间比之后的要长得多。那是因为cudaMemcpy 进行了隐式同步，所以cudaMemcpy 的计时器实际上包括内核完成所花费的时间等。也许您仍然可以尝试以更“传统”的方式测量时间， IE。通过cudaEvents 或类似的东西（不知道这是否适用于Thrust），看看结果是否不同。但不幸的是，除此之外，我一无所知。
@Beko 我认为你是对的。我认为 cuda 正在等待 opengl 完成他的工作，这就是为什么第一种需要更长的时间。第一种需要的时间越长，我的相机离场景越近（我越接近 opengl 要做的工作就越多）。这表明 cuda 可能正在等待 opengl 完成。

标签： c++ opengl cuda thrust

【解决方案1】：

两点可以解释你的观察：

第一个 CUDA 函数调用隐式初始化运行时（很慢）。
要排序的数组的实际内容可以/经常影响排序的性能（关于 Thrust 实现中使用的算法）。数据排序后，可以更快地排序，因为它们已经排序。
Thrust 在许多提供的函数中很少进行同步（即它调用cudaDeviceSynchronize），以确保可以从CPU 端安全地读取从GPU 传输的返回数据。当针对计算数据的结果提交多个相互依赖的 CUDA 内核时，它还在内部使用这种同步（您可以使用 Nvidia 分析器看到这一点）。对于之前在此函数之前进行的异步 CUDA 调用，过度同步会增加不必要的开销。

【讨论】：

嗨@Jérôme Richard 前两个缓冲区不一样，因此，两者都没有排序，时间增量为 2 毫秒。也许是运行时初始化就像你说的那样。你知道有没有办法只做一次？因为我的排序运行每一帧。
你可以在你的 main 调用 cudaThreadSynchronize 在任何计算之前调用 cuda 初始化。我假设您使用thrust::stable_sort_by_key 调用之间的cudaThreadSynchronize 调用（未注释）来测量时间，并将它们包含在您的时间中。是这样吗？
我在我正在使用的引擎中使用内置分析器测量时间。引擎测量模拟中每个“步骤”（通过）的时间。我会和我的老师（创造引擎的人）谈谈，问他时间是否正确。
对不起，我很困惑。它是cudaDeviceSynchronize 而不是 cudaThreadSynchronize。您可以使用std::chrono::steady_clock::now(); 自己测量时间。请注意，时间是在 CPU 端检索的，因此需要调用 cudaDeviceSynchronize 以确保正确测量 GPU 操作（包括设备同步的开销），除非您确定 Thrust 进行同步。
好的。有趣的。所以我可能会删除这个答案。你能在你的帖子中报告那些有用的信息吗？出于好奇：什么时候您没有任何呼叫到 Thrust，而在mysort 中只有一个呼叫到cudaDeviceSynchronize？