无法测量 CUDA 程序和 CUDA 内核的运行时间答案

【问题标题】：Trouble measuring the elapsed time of a CUDA program and CUDA kernels无法测量 CUDA 程序和 CUDA 内核的运行时间
【发布时间】：2012-08-21 03:51:03
【问题描述】：

我目前有三种测量经过时间的方法，两种使用 CUDA 事件，另一种记录开始和结束 UNIX。使用 CUDA 事件的那些测量两件事，一个测量整个外部循环时间，另一个测量所有内核执行时间。

代码如下：

int64 x1, x2;

cudaEvent_t start;
cudaEvent_t end;
cudaEvent_t s1, s2;
float timeValue;


 #define timer_s cudaEventRecord(start, 0);
 #define timer_e cudaEventRecord(end, 0);   cudaEventSynchronize(end); cudaEventElapsedTime( &timeValue, start, end ); printf("time:  %f  ms \n", timeValue);


cudaEventCreate( &start );
cudaEventCreate( &end );
cudaEventCreate( &s1 );
cudaEventCreate( &s2 );

cudaEventRecord(s1, 0);   
x1 = GetTimeMs64();

for(int r = 0 ; r < 2 ; r++)
{
    timer_s
    kernel1<<<1, x>>>(gl_devdata_ptr);
    cudaThreadSynchronize();
    timer_e
    sum += timeValue;

    for(int j = 0 ; j < 5; j++)
    {
        timer_s
        kernel2<<<1,x>>>(gl_devdata_ptr);
        cudaThreadSynchronize();
        timer_e
        sum += timeValue;

        timer_s
        kernel3<<<1,x>>>(gl_devdata_ptr);
        cudaThreadSynchronize();
        timer_e
        sum += timeValue;
    }

    timer_s
    kernel4<<<y, x>>> (gl_devdata_ptr);
    cudaThreadSynchronize();
    timer_e
    sum += timeValue;
}

x2 = GetTimeMs64();

cudaEventRecord(s2, 0);   
cudaEventSynchronize(s2); 
cudaEventElapsedTime( &timeValue, s1, s2 ); 
printf("elapsed cuda :       %f  ms \n", timeValue);
printf("elapsed sum :       %f  ms \n", sum);
printf("elapsed win :       %d  ms \n", x2-x1);

GetTimeMs64 是我在 StackOverflow 上找到的：

int64 GetTimeMs64()
{
 /* Windows */
 FILETIME ft;
 LARGE_INTEGER li;
 uint64 ret;

 /* Get the amount of 100 nano seconds intervals elapsed since January 1, 1601 (UTC) and copy it
  * to a LARGE_INTEGER structure. */
 GetSystemTimeAsFileTime(&ft);
 li.LowPart = ft.dwLowDateTime;
 li.HighPart = ft.dwHighDateTime;

 ret = li.QuadPart;
 ret -= 116444736000000000LL; /* Convert from file time to UNIX epoch time. */
 ret /= 10000; /* From 100 nano seconds (10^-7) to 1 millisecond (10^-3) intervals */

 return ret;
}

这些不是真正的变量名，也不是正确的内核名称，我只是删除了一些以使代码更小。

所以问题是，每次测量都给我一个完全不同的总时间。

我刚刚运行的一些示例：

elapsed cuda : 21.076832    
elapsed sum :  4.177984     
elapsed win :  27

那么为什么会有这么大的差异呢？所有内核调用的总和大约是 4 毫秒，其他 18 毫秒在哪里？ CPU 时间？

【问题讨论】：

标签： time cuda

【解决方案1】：

cudaThreadSynchronize 是一项开销很大的操作，因为它必须等待 GPU 上的所有工作完成。

如果您按如下方式构建代码，您应该会得到正确的结果：

int64 x1, x2;

cudaEvent_t start;
cudaEvent_t end;
const int k_maxEvents = 5 + (2 * 2) + (2 * 5 * 2);
cudaEvent_t events[k_maxEvents];
int eIdx = 0;
float timeValue;

for (int e = 0; e < 5; ++e)
{
    cudaEventCreate(&events[e]);
}

x1 = GetTimeMs64();
cudaEventRecord(events[eIdx++], 0);       
for(int r = 0 ; r < 2 ; r++)
{
    cudaEventRecord(events[eIdx++], 0);
    kernel1<<<1, x>>>(gl_devdata_ptr);

    for(int j = 0 ; j < 5; j++)
    {
        cudaEventRecord(events[eIdx++], 0);
        kernel2<<<1,x>>>(gl_devdata_ptr);

        cudaEventRecord(events[eIdx++], 0);
        kernel3<<<1,x>>>(gl_devdata_ptr);
    }

    cudaEventRecord(events[eIdx++], 0);
    kernel4<<<y, x>>> (gl_devdata_ptr);
}

cudaEventRecord(eIdx++, 0);   
cudaDeviceSynchronize(); 

x2 = GetTimeMs64();

cudaEventElapsedTime( &timeValue, events[0], events[k_maxEvents - 1] ); 
printf("elapsed cuda :       %f  ms \n", timeValue);
// TODO the time between each events is the time to execute each kernel.
// On WDDM a context switch may occur between any of the kernels leading
// to higher than expected results.
// printf("elapsed sum :       %f  ms \n", sum);
printf("elapsed win :       %d  ms \n", x2-x1);

在 Windows 上，测量时间的更简单方法是使用 QueryPerformanceCounter 和 QueryPerformanceFrequency。

如果你把上面的例子写成没有事件

#include "NvToolsExt.h"
nvtxRangePushA("CPU Time");
for(int r = 0 ; r < 2 ; r++)
{
    kernel1<<<1, x>>>(gl_devdata_ptr);

    for(int j = 0 ; j < 5; j++)
    {
        kernel2<<<1,x>>>(gl_devdata_ptr); 
        kernel3<<<1,x>>>(gl_devdata_ptr);
    }
    kernel4<<<y, x>>> (gl_devdata_ptr);
}

cudaDeviceSynchronize(); 
nvtxRangePop();

并在 Nsight Visual Studio Edition 1.5-2.2 CUDA Trace Activity 或 Visual Profiler 4.0+ 中运行，所有时间都将可用。 GPU 时间将比使用 cudaEvents API 收集的更准确。使用 nvtxRangePush 测量 CPU 时间范围是可选的。这也可以通过测量示例中的第一个 CUDA API 到 cudaDeviceSynchronize 结束来完成。

【讨论】：

在阅读了您的回答后，我尝试了 QueryPerformanceCounter 并且它起作用了，从这里获得了功能：stackoverflow.com/questions/1739259/… 谢谢。