将 CUDA printf 重定向到 C++ 流答案

【问题标题】：Redirecting CUDA printf to a C++ stream将 CUDA printf 重定向到 C++ 流
【发布时间】：2014-01-20 15:50:16
【问题描述】：

问题陈述

我正在开发一个使用记录器进行调试的大型项目。由于我喜欢跟踪某些 CUDA 内核中发生的事情，我试图找到一种方法将我的 CUDA 内核的printf 重定向到stringstream（或任何流），然后可以将其转发到记录器.

可能的解决方案

我通过使用以下代码设法做到了：

#include <cuda.h>
#include <stdio.h>
#include <unistd.h> // dup

#include <iostream>
#include <sstream> // stringstream
#include <fstream> // ofstream

char* output_file = "printf_redirect.log";

__global__ void printf_redirect(int* src, int* res)
{
    res[threadIdx.x] = threadIdx.x;
    printf("  %i: Hello World!\n", res[threadIdx.x]);
}

int main()
{
    using namespace std;

    const uint N = 2;

    // Note: dummy arrays are not actually used, but this should prevent the
    //       compiler from discarding the printf in the kernel.

    int *d_A, *d_B, *h_A, *h_B;
    size_t size = N * sizeof (int);
    cudaMalloc (&d_A, size);
    cudaMalloc (&d_B, size);
    h_A = (int*) malloc (size);
    h_B = (int*) malloc (size);
    cudaMemcpy (d_A, h_A, size, cudaMemcpyHostToDevice);

    std::cout << "std::cout - start" << std::endl;
    printf ("stdout - start\n");

    /// REGULAR PRINT
    // Print to regular stdout
    std::cout << "Output to stdout:" << std::endl;
    printf_redirect<<<1,1>>> (d_A, d_B);
    cudaDeviceSynchronize ();

    /// REDIRECTION TO STRINGSTREAM
    std::stringstream ss;
    // Redirect std::cout to a stringstream
    std::streambuf* backup_cout = std::cout.rdbuf ();
    std::cout.rdbuf (ss.rdbuf ());
    // Redirect stdout to a buffer
    char buf[1024] = "";
    int backup_stdout = dup (fileno (stdout));
    freopen ("/dev/null", "w", stdout);
    setbuf (stdout, buf);

    std::cout << "Redirected output:" << std::endl;
    printf_redirect<<<1,N>>> (d_A, d_B);
    cudaDeviceSynchronize ();

    // Add CUDA buffer to a stringstream
    ss << buf;

    // Write stringstream to file
    std::ofstream outFile;
    outFile.open (output_file);
    outFile << ss.str ();
    outFile.close ();

    /// RESET REDIRECTION
    // Redirect back to initial stdout
    fflush (stdout);
    setbuf (stdout, NULL);
    fclose (stdout);
    FILE *fp = fdopen (backup_stdout, "w");
    fclose (stdout);
    *stdout = *fp;
    // Redirect back to initial std::cout
    std::cout.rdbuf (backup_cout);

    std::cout << "std::cout - end" << std::endl;
    printf ("stdout - end\n");

    cudaMemcpy(h_B, d_B, size, cudaMemcpyDeviceToHost);

    cudaFree(d_A);
    cudaFree(d_B);
    free (h_A);
    free (h_B);
}

我使用了以下问题来实现这一点：

运行程序，我们进入控制台：

std::cout - start
stdout - start
Output to stdout:
  0: Hello World!
std::cout - end
stdout - end

在printf_redirect.log:

Redirected output:
  0: Hello World!
  1: Hello World!

问题

有没有更简单的方法来实现这一点？（例如隐藏的 CUDA 选项或简洁的 C/C++ 技巧）

请注意，最终解决方案将在实用程序类中结束，以减少实际代码中的冗长。

【问题讨论】：

我看到，当尝试从内核执行 cout 时，编译器会给出一个错误，告诉它不能完成，因为这是一个主机函数。你能解释一下为什么 cout 被认为是一个宿主函数而不是 printf 吗？

标签： c++ linux logging cuda io-redirection

【解决方案1】：

设备端printf() 导致正在打印的线程隐式序列化，因此您可能不想在生产代码中使用它。

设备端printf() 通过让内核将消息复制到预先分配的环形缓冲区来工作。在隐式或显式设备同步 (cudaDeviceSynchronize()) 时，CUDA 会将缓冲区中的所有内容转储到 stdout，然后将其清除。

您可以简单地实现自己的设备printf()。它的性能可能不会比内置的差。唯一的缺点是您必须将环形缓冲区传递给内核并在内核返回后添加一个处理它的调用。

实现是这样的：

为固定数量的printf() 格式化字符串和相关的 32 位或 64 位参数创建一个留有空间的缓冲区。
创建一个设备函数，该函数使用atomicInc() 跟踪当前打印位置，并采用格式化字符串和参数并将它们复制到当前位置。
将环形缓冲区传递给内核，然后内核将其与打印参数一起传递给设备函数。
创建一个接收环形缓冲区的主机函数，通过主机端sprintf() 运行格式化字符串和参数并将结果传递给记录器。

【讨论】：

有一个较旧的cuprintf 由 NVIDIA 作为源代码在 Fermi 之前发布给开发人员。我相信它可以适应这个目的。
旧的 cuPrintf 在 cuda simplePrintf sample code 中仍然“可用”。
我看到，当尝试从内核执行 cout