为什么 OpenCV 调用 UMat 操作有时需要半秒才能完成？答案

【问题标题】：Why do OpenCV calls to UMat operations sometimes take half of a second to complete?为什么 OpenCV 调用 UMat 操作有时需要半秒才能完成？
【发布时间】：2021-01-13 21:09:48
【问题描述】：

由于奇怪的计时结果，我第一次尝试使用 GPU 令人失望。下面是在 OpenCV 中使用 UMat 来查找二进制图像顶部和底部的死区的 sn-p 代码。大多数时候 findNonZero 调用在不到 1 毫秒的时间内执行，但偶尔需要超过 500 毫秒！延迟似乎与结果的大小无关。有人可以提供解释并修复吗？

    UMat bin; 
    // bin is loaded with a binary image of about 60 x 60;

    int top = bin.rows;
    int bottom = 0;
    for (int i = 0; i < bin.cols; i++)
    {
        UMat r = bin.col(i);
        vector<Point> pxls;
        findNonZero( r, pxls);
        cout << pxls << endl;
        if (!pxls.empty())
        {
            if (pxls.front().y < top) top = pxls.front().y;
            if (pxls.back().y > bottom) bottom = pxls.back().y;
        }
    }

这是关于我的 OpenCL 支持的报告：

1 GPU devices are detected.
name:              AMD KAVERI (DRM 2.50.0, 5.8.0-36-generic, LLVM 11.0.0)
available:         1
imageSupport:      0
OpenCL_C_Version:  OpenCL C 1.1 

Number of platforms                               1
  Platform Name                                   Clover
  Platform Vendor                                 Mesa
  Platform Version                                OpenCL 1.1 Mesa 20.2.6
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd
  Platform Extensions function suffix             MESA

  Platform Name                                   Clover
Number of devices                                 1
  Device Name                                     AMD KAVERI (DRM 2.50.0, 5.8.0-36-generic, LLVM 11.0.0)
  Device Vendor                                   AMD
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.1 Mesa 20.2.6
  Driver Version                                  20.2.6
  Device OpenCL C Version                         OpenCL C 1.1 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Max compute units                               8
  Max clock frequency                             720MHz
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
  Preferred work group size multiple              64
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 0 / 0        (n/a)
    float                                                4 / 4       
    double                                               2 / 2        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              2142642176 (1.995GiB)
  Error Correction support                        No
  Max memory allocation                           1499849523 (1.397GiB)
  Unified memory for Host and Device              No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       32768 bits (4096 bytes)
  Global Memory cache type                        None
  Image support                                   No
  Local memory type                               Local
  Local memory size                               32768 (32KiB)
  Max number of constant args                     16
  Max constant buffer size                        1499849472 (1.397GiB)
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Profiling timer resolution                      0ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64

【问题讨论】：

我还发现来自 OpenCL 的 warpPolar(...) 的结果同样缓慢并且实际上返回了损坏的数据。我会添加一些数字来说明，但在这里我没有看到上传图片的选项。

标签： c++ opencv time opencl

【解决方案1】：

只有以高效的方式使用 API，OpenCL 代码才会高效。 OpenCV 代码库非常低效地使用 OpenCL，并且还通过其“透明 API”（cv::UMat）促进了低效的使用模式。

一些低效率的例子：

内核代码的延迟编译（第一次调用 opencv 函数可能需要很多毫秒）
不必要的分配和内存转移。
不必要的 CPU-GPU 同步。

更多信息here。

很难确切地知道是什么使您的代码效率低下，因为您不共享初始化 cv::UMat 的方式。但这里有一些一般性提示。

明确确保分配 GPU 内存

auto mat = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);

在主循环之外调用 OpenCV API 函数，以确保它正在编译必要的 OpenCL 内核。
做实际工作，但尽可能少使用 OpenCV API 调用。因此，例如，在每一列上单独执行findNonZero( r, pxls) 是一个坏主意。即使findNonZero 有一个 OpenCL 后端（您需要检查实现以确保），OpenCV 也可能会在每次调用时同步 GPU。在性能方面非常糟糕。最好一次在整个缓冲区上调用它，然后按列处理结果。
了解这一切都是无望的，学习使用 OpenCL 和适当的 GPU 分析工具来真正了解正在发生的事情。祝你好运。

【讨论】：

感谢您的观察！这是一个了解影响时间的因素的练习，因此逐列示例并不意味着高效。 90% 的 findNonZero 调用（只是那个调用，而不是整个程序）在大约 0.13 毫秒内返回，其余的在大约 504 毫秒内返回，这很奇怪，我想理解。如果它一直很慢，我会放弃该功能，但不一致让我们想知道是否有一些调整可以改善它。是的，我确实在 opencv 源代码中看到了 findNonZero 的 OpenCL 特定后端。
要完全了解发生了什么，您必须使用可以检测 OpenCL API 调用和内核执行的工具来分析代码。比如你的 CodeXL。