调用 finish() 时 OpenCL 崩溃答案

【问题标题】：OpenCL crash when calling finish()调用 finish() 时 OpenCL 崩溃
【发布时间】：2020-01-23 04:02:43
【问题描述】：

我正在使用 c++ 在 mac 上编写一个 OpenCL 应用程序，它在某些情况下会崩溃，具体取决于工作量。程序因 SIGABRT 而崩溃。

有什么方法可以获取有关错误的更多信息？为什么要提出 SIGABRT？我能抓住它吗？

编辑：我知道这个程序很牛逼，但是我会试着解释一下，以防有人想尝试一下。

通过调试我发现SIGABRT的原因是内核超时之一。

该程序是基于图块的 3D 渲染器。这是该算法的 OpenCL 实现：https://github.com/ssloy/tinyrenderer

屏幕分为 8x8 块。其中一个内核（切片器）计算哪些多边形与每个切片重叠，并将结果存储在名为tilePolys 的数据结构中。随后的内核（光栅化器）在每个图块上运行一个工作项，遍历占据图块的多边形列表并将它们光栅化。

平铺程序写入一个整数缓冲区，该缓冲区是多边形索引列表的列表。每个列表的大小都是固定的（polysPerTile + 1 表示计数），其中第一个元素是计数，随后的 polysPerTile 元素是图块中多边形的索引。每个图块都有一个这样的列表。

由于某些原因，在某些情况下，拼贴器会将非常大的多边形计数 (13172746) 写入tilePolys 中的拼贴列表之一。这会导致光栅化器循环很长时间并超时。

奇怪的是，写入大计数的索引永远不会被 tiler 访问。

tiler内核的代码如下：

// this kernel is executed once per polygon
// it computes which tiles are occupied by the polygon and adds the index of the polygon to the list for that tile
kernel void tiler(
        // number of polygons
        ulong nTris,
        // width of screen
        int width,
        // height of screen
        int height,
        // number of tiles in x direction
        int tilesX,
        // number of tiles in y direction
        int tilesY,
        // number of pixels per tile (tiles are square)
        int tileSize,
        // size of the polygon list for each tile
        int polysPerTile,
        // 4x4 matrix representing the viewport
        global const float4* viewport, 
        // vertex positions
        global const float* vertices,
        // indices of vertices
        global const int* indices,
        // array of array-lists of polygons per tile
        // structure of list is an int representing the number of polygons covering that tile, 
        // followed by [polysPerTile] integers representing the indices of the polygons in that tile
        // there are [tilesX*tilesY] such arraylists
        volatile global int* tilePolys)
{
    size_t faceInd = get_global_id(0);

    // compute vertex position in viewport space
    float3 vs[3];
    for(int i = 0; i < 3; i++) {
        // indices are vertex/uv/normal
        int vertInd = indices[faceInd*9+i*3];

        float4 vertHomo = (float4)(vertices[vertInd*4], vertices[vertInd*4+1], vertices[vertInd*4+2], vertices[vertInd*4+3]);

        vertHomo = vec4_mul_mat4(vertHomo, viewport);
        vs[i] = vertHomo.xyz / vertHomo.w;
    }

    float2 bboxmin = (float2)(INFINITY,INFINITY);
    float2 bboxmax = (float2)(-INFINITY,-INFINITY);

    // size of screen
    float2 clampCoords = (float2)(width-1, height-1);

    // compute bounding box of triangle in screen space
    for (int i=0; i<3; i++) {
        for (int j=0; j<2; j++) {
            bboxmin[j] = max(0.f, min(bboxmin[j], vs[i][j]));
            bboxmax[j] = min(clampCoords[j], max(bboxmax[j], vs[i][j]));
        }
    }

    // transform bounding box to tile space
    int2 tilebboxmin = (int2)(bboxmin[0] / tileSize, bboxmin[1] / tileSize);
    int2 tilebboxmax = (int2)(bboxmax[0] / tileSize, bboxmax[1] / tileSize);

    // loop over all tiles in bounding box
    for(int x = tilebboxmin[0]; x <= tilebboxmax[0]; x++) {
        for(int y = tilebboxmin[1]; y <= tilebboxmax[1]; y++) {

            // get index of tile
            int tileInd = y * tilesX + x;
            // get start index of polygon list for this tile
            int counterInd = tileInd * (polysPerTile + 1);
            // get current number of polygons in list
            int numPolys = atomic_inc(&tilePolys[counterInd]);
            // if list is full, skip tile
            if(numPolys >= polysPerTile) {
                // decrement the count because we will not add to the list
                atomic_dec(&tilePolys[counterInd]);
            } else {
                // otherwise add the poly to the list
                // the index is the offset + numPolys + 1 as tilePolys[counterInd] holds the poly count
                int ind = counterInd + numPolys + 1;
                tilePolys[ind] = (int)(faceInd);
            }   
        }
    }
}

我的理论是：

我错误地实现了用于读取和递增计数的原子函数
我使用了不正确的数字格式，导致垃圾被写入tilePolys
我的其他内核之一无意中写入了tilePolys 缓冲区

我不认为这是最后一个，因为如果我不将faceInd 写入tilePolys，而是写入一个常量值，那么大多边形数就会消失。

tilePolys[counterInd+numPolys+1] = (int)(faceInd); // this is the problem line
tilePolys[counterInd+numPolys+1] = (int)(5);       // this fixes the issue

【问题讨论】：

我对 OpenCL C++ 接口并不太熟悉，但在 C 语言中，大多数函数都会返回错误代码或设置错误参数。你在检查这些吗？

标签： macos opencl sigabrt

【解决方案1】：

看起来您的内核在 GPU 本身上崩溃了。你不能直接得到任何额外的诊断，至少在 macOS 上是这样。你需要开始缩小问题的范围。一些建议：

由于当前在clFinish() 中发生崩溃，您不知道是什么异步命令导致了崩溃。尝试将所有排队调用切换到阻塞模式。这应该会导致它在实际出错的调用中崩溃。
检查所有 OpenCL API 调用的返回/错误代码。有时，忽略早期调用中的错误可能会导致依赖早期结果的后续调用出现问题。例如，如果创建缓冲区失败，则将该缓冲区创建的结果作为内核参数传递将导致尝试运行内核时出现问题。
崩溃的最可能原因是您的 OpenCL 内核访问内存越界或滥用指针。重新检查所有数组索引计算。
检查问题是否出现在较小的工作批次中。从一个工作组（或工作项，如果不使用组）扩大规模，看看它是否只出现在某个工作规模之外。这可能会为您提供有关可能导致崩溃的缓冲区大小和数组索引的线索。
系统地注释掉内核的某些部分。如果您在注释掉一段特定代码后崩溃消失，那么问题很可能出在该代码中。
如果您已将问题缩小到一小段代码，但无法确定问题的来源，请开始记录诊断输出以检查变量是否具有您期望的值。

没有看到任何代码，我无法给你任何比这更具体的建议。

请注意，OpenCL 在 macOS 上已弃用，因此如果您专门针对该平台并且不需要支持 Linux、Windows 等。我建议改为学习 Metal Compute。 Apple 已经明确表示这是他们想要支持的 GPU 编程平台，并且它的工具已经比他们的 OpenCL 工具要好得多。

我怀疑 Apple 在发布具有新型 GPU 的 Mac 时最终会停止实施 OpenCL 支持，因此即使您的目标是 Mac 以及其他平台，您也可能需要在 Mac 上切换到 Metal无论如何，在某个地方。从 macOS 10.14 开始，操作系统的最低系统要求已经包括支持 Metal 的 GPU，因此如果您希望支持所有能够运行 10.13 或更旧操作系统版本的 Mac 机型，则只需使用 OpenCL 作为备用。

【讨论】：

这是一个很好的答案，所以我只想将其添加为评论。我会注释掉内核中的所有读取和写入，以查看崩溃是否消失。然后向后工作。有时我会从设备端printfs 调试内核中获得一些运气。
感谢您的全面回复。我已经编辑了问题以包含一些代码，以防有人想看看。