opencl 速度和 OUT_OF_RESOURCES答案

【问题标题】：opencl speed and OUT_OF_RESOURCESopencl 速度和 OUT_OF_RESOURCES
【发布时间】：2018-02-12 17:05:04
【问题描述】：

我对 opencl 很陌生，正在尝试我的第一个程序。我实现了一个简单的波形正弦滤波。该代码有效，但是我有两个问题：

一旦我增加输入矩阵的大小（行数需要增加到 100 000），即使矩阵相对较小（几 mb），我也会得到（clEnqueueReadBuffer 失败：OUT_OF_RESOURCES）。这在某种程度上与我认为的工作组规模有关，但有人可以详细说明我如何解决这个问题吗？会不会是驱动问题？

更新：

离开组大小None 崩溃
调整 GPU (1,600) 和 IntelHD (1,50) 的组大小让我可以达到大约 6400 行。然而，对于更大的尺寸，它在 GPU 上崩溃，而 IntelHD 只是冻结并且什么都不做（资源监视器上为 0%）

2.我有 Intel HD4600 和 Nvidia K1100M GPU 可用，但是 Intel 的速度要快约 2 倍。我理解部分原因是我不需要将数组复制到与外部 GPU 不同的内部 Intel 内存。但是我预计会有边际差异。这是正常的还是应该更好地优化我的代码以在 GPU 上使用？ （已解决）

感谢您的帮助！！

    from __future__ import absolute_import, print_function
    import numpy as np
    import pyopencl as cl
    import os
    os.environ['PYOPENCL_COMPILER_OUTPUT'] = '1'       
    import matplotlib.pyplot as plt

    def resample_opencl(y,key='GPU'):
            #
            # selecting to run on GPU or CPU
            #
            newlen = 1200  
            my_platform = cl.get_platforms()[0]
            device =my_platform.get_devices()[0] 

            for found_platform in cl.get_platforms():    
                if (key == 'GPU') and (found_platform.name == 'NVIDIA CUDA'):         
                    my_platform = found_platform
                    device =my_platform.get_devices()[0]
                    print("using GPU")


            #
            #Create context for GPU/CPU
            #
            ctx = cl.Context([device])

            #
            # Create queue for each kernel execution
            #
            queue = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
    #        queue = cl.CommandQueue(ctx)

            prg = cl.Program(ctx, """
            __kernel void resample(      
                int M,
                __global const float *y_g,
                __global float *res_g)

            {
                int row = get_global_id(0);
                int col = get_global_id(1);
                int gs = get_global_size(1);
                __private float tmp,tmp2,x;
                __private float t;
                t = (float)(col)/2+1;


                tmp=0;
                tmp2=0;


                for (int i=0; i<M ; i++)
                    {
                    x = (float)(i+1);
                    tmp2 = (t- x)*3.14159;
                    if (t == x) {
                        tmp += y_g[row*M + i]  ;
                                    }
                    else 
                        tmp += y_g[row*M +i]  * sin(tmp2)/tmp2;
                     }

                res_g[row*gs +  col] = tmp;


            }
            """).build()

            mf = cl.mem_flags

            y_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=y)
            res = np.zeros((np.shape(y)[0],newlen)).astype(np.float32)
            res_g = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)

            M = np.array(600).astype(np.int32)
            prg.resample(queue, res.shape, (1,200),M, y_g, res_g)


            event = cl.enqueue_copy(queue, res, res_g)
            print("success")
            event.wait()
            return res,event





    if __name__ == "__main__":
        #
        # this is the number i need to increase ( up to some 100 000)
        numrows = 2000  
        Gaussian = lambda t : 10 * np.exp(-(t - 50)**2 / (2. * 2**2))


        x = np.linspace(1, 101, 600, endpoint=False).astype(np.float32)
        t = np.linspace(1, 101, 1200, endpoint=False).astype(np.float32)
        y= np.zeros(( numrows,np.size(x)))
        y[:] = Gaussian(x).astype(np.float32)
        y = y.astype(np.float32)

        res,event =  resample_opencl(y,'GPU')
        print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)

        #
        # test plot if it worked
        #
        plt.figure()
        plt.plot(x,y[1,:],'+')
        plt.plot(t,res[1,:])

【问题讨论】：

标签： python-3.x opencl pyopencl

【解决方案1】：

关于 1.

您的newlen 必须能被 200 整除，因为这是您设置为本地维度 (1,200) 的值。我将其增加到 9600，但仍然可以正常工作。

更新

在您更新后，我建议不要指定本地尺寸，而是让实施来决定：

prg.resample(queue, res.shape, None,M, y_g, res_g)

如果newlen 和numrows 是 16 的乘积，也可能会提高性能。

Nvidia GPU 的性能必须优于 Intel GPU 并不是一个规则，尤其是根据 Wikipedia，它们之间的 GFLOPS 没有太大差异（549.89 与 288–432）。这种 GFLOPS 比较应该与盐粒进行比较，因为一种算法可能比另一种更适合一个 GPU。换句话说，从这个数字来看，您可能会认为一个 GPU 通常比另一个更快，但这可能因算法而异。

100000 行的内核需要：

y_g: 100000 * 600 * 4 = 240000000 bytes =~ 229MB
res_g: 100000 * 1200 * 4 = 480000000 bytes =~ 457,8MB

Quadro K1100M 具有 2GB 的全局内存，足以处理 100000 行。我发现的 Intel HD 4600 受到系统内存的限制，所以我怀疑这也不应该是一个问题。

关于 2.

时间测量不正确。不是测量内核执行时间，而是测量将数据复制回主机的时间。因此，CPU 的这个数字较低也就不足为奇了。要测量内核执行时间，请执行以下操作：

event = prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event.wait()
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)

我不知道如何衡量整个事情，包括使用 pyopencl 中的 OpenCL 分析事件将数据复制回主机，但仅使用 python 会给出类似的结果：

start = time.time()
... #code to be measured
end = time.time()
print(end - start)

【讨论】：

重新 1 。将它增加到 9600 不幸的是我得到了同样的错误......但我已经更新了我对增加 numrows Re 2 感兴趣的问题。感谢您的澄清。然而，即使使用 python，我的 K1100 GPU 似乎也比 intel HD 慢。

【解决方案2】：

我想我解决了这个问题：

IntelHd：关闭分析可以修复所有问题。可以毫无问题地运行代码。
K1100M GPU 仍然崩溃，但我怀疑这可能是超时问题，因为我在显示器上使用相同的显卡。

【讨论】：