【问题标题】:Speedup sum of intensities calculation by using JOCL/OPENCL使用 JOCL/OPENCL 加速强度总和计算
【发布时间】:2012-11-27 00:34:03
【问题描述】:

您好,我是 JOCL (opencl) 的新手。我编写了这段代码来获取每张图像的强度总和。内核采用一个包含所有图像的所有像素的一维数组,这些图像的所有像素相互放在一起。图像为 300x300 ,因此每张图像有 90000 像素。目前它比我按顺序执行时要慢。

我的代码

package PAR;

/*
 * JOCL - Java bindings for OpenCL
 * 
 * Copyright 2009 Marco Hutter - http://www.jocl.org/
 */
import IMAGE_IO.ImageReader;
import IMAGE_IO.Input_Folder;
import static org.jocl.CL.*;

import org.jocl.*;

/**
 * A small JOCL sample.
 */
public class IPPARA {

    /**
     * The source code of the OpenCL program to execute
     */
    private static String programSource =
            "__kernel void "
            + "sampleKernel(__global uint *a,"
            + "             __global uint *c)"
            + "{"
            + "__private uint intensity_core=0;"
            + "      uint i = get_global_id(0);"
            + "      for(uint j=i*90000; j < (i+1)*90000; j++){ "
            + "              intensity_core += a[j];"
            + "     }"
            + "c[i]=intensity_core;" 
            + "}";

    /**
     * The entry point of this sample
     *
     * @param args Not used
     */
    public static void main(String args[]) {
        long numBytes[] = new long[1];

        ImageReader imagereader = new ImageReader() ;
        int srcArrayA[]  = imagereader.readImages();

        int size[] = new int[1];
        size[0] = srcArrayA.length;
        long before = System.nanoTime();
        int dstArray[] = new int[size[0]/90000];


        Pointer srcA = Pointer.to(srcArrayA);
        Pointer dst = Pointer.to(dstArray);


        // Obtain the platform IDs and initialize the context properties
        System.out.println("Obtaining platform...");
        cl_platform_id platforms[] = new cl_platform_id[1];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);

        // Create an OpenCL context on a GPU device
        cl_context context = clCreateContextFromType(
                contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);
        if (context == null) {
            // If no context for a GPU device could be created,
            // try to create one for a CPU device.
            context = clCreateContextFromType(
                    contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);

            if (context == null) {
                System.out.println("Unable to create a context");
                return;
            }
        }

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Get the list of GPU devices associated with the context
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);

        // Obtain the cl_device_id for the first device
        int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0],
                Pointer.to(devices), null);

        // Create a command-queue
        cl_command_queue commandQueue =
                clCreateCommandQueue(context, devices[0], 0, null);

        // Allocate the memory objects for the input- and output data
        cl_mem memObjects[] = new cl_mem[2];
        memObjects[0] = clCreateBuffer(context,
                CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                Sizeof.cl_uint * srcArrayA.length, srcA, null);
        memObjects[1] = clCreateBuffer(context,
                CL_MEM_READ_WRITE,
                Sizeof.cl_uint * (srcArrayA.length/90000), null, null);

        // Create the program from the source code
        cl_program program = clCreateProgramWithSource(context,
                1, new String[]{programSource}, null, null);

        // Build the program
        clBuildProgram(program, 0, null, null, null, null);

        // Create the kernel
        cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);

        // Set the arguments for the kernel
        clSetKernelArg(kernel, 0,
                Sizeof.cl_mem, Pointer.to(memObjects[0]));
        clSetKernelArg(kernel, 1,
                Sizeof.cl_mem, Pointer.to(memObjects[1]));

        // Set the work-item dimensions
        long local_work_size[] = new long[]{1};
        long global_work_size[] = new long[]{(srcArrayA.length/90000)*local_work_size[0]};


        // Execute the kernel
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
                global_work_size, local_work_size, 0, null, null);

        // Read the output data
        clEnqueueReadBuffer(commandQueue, memObjects[1], CL_TRUE, 0,
                (srcArrayA.length/90000) * Sizeof.cl_float, dst, 0, null, null);

        // Release kernel, program, and memory objects
        clReleaseMemObject(memObjects[0]);
        clReleaseMemObject(memObjects[1]);
        clReleaseKernel(kernel);
        clReleaseProgram(program);
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);


        long after = System.nanoTime();

        System.out.println("Time: " + (after - before) / 1e9);

    }
}

根据答案中的建议,通过 CPU 的并行代码几乎与顺序代码一样快。还有什么可以改进的吗?

【问题讨论】:

    标签: java opencl jocl


    【解决方案1】:
     for(uint j=i*90000; j < (i+1)*90000; j++){ "
            + "              c[i] += a[j];"
    

    1)您正在使用全局内存(c[])求和,这很慢。使用私有变量使其更快。 像这样的:

              "__kernel void "
            + "sampleKernel(__global uint *a,"
            + "             __global uint *c)"
            + "{"
            + "__private uint intensity_core=0;" <---this is a private variable of each core
            + "      uint i = get_global_id(0);"
            + "      for(uint j=i*90000; j < (i+1)*90000; j++){ "
            + "              intensity_core += a[j];" <---register is at least 100x faster than global memory
             //but we cannot get rid of a[] so the calculation time cannot be less than %50
            + "     }"
            + "c[i]=intensity_core;"   
            + "}";  //expecting %100 speedup
    

    现在你有了 c[number of images] 强度总和数组。

    您的 local-work-size 为 1,如果您至少有 160 张图像(这是您的 gpu 的核心编号),那么计算将使用所有核心。

    您将需要 90000*num_images 次读取和 num_images 次写入以及 90000*num_images 次寄存器读取/写入。使用寄存器将使您的内核时间减半。

    2)每 2 次内存访问,您只做 1 次数学运算。每 1 次内存访问至少需要 10 次数学运算才能使用 gpu 峰值 Gflops 的一小部分(6490M 峰值为 250 Gflops)

    您的 i7 cpu 可以轻松拥有 100 Gflops,但您的内存将成为瓶颈。当您通过 pci-express 发送整个数据时,情况会更糟。(HD Graphics 3000 的额定速度为 125 GFLOPS)

     // Obtain a device ID 
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
        cl_device_id device = devices[deviceIndex];
     //one of devices[] element must be your HD3000.Example: devices[0]->gpu devices[1]->cpu 
     //devices[2]-->HD3000
    

    在你的程序中:

     // Obtain the cl_device_id for the first device
        int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0],
                Pointer.to(devices), null);
    

    使用第一个设备可能是 gpu。

    【讨论】:

    • 您能否提供一些有关如何使用我当前代码执行此操作的见解?
    • 你的 cpu/gpu 是什么?您的本地工作量为 1,这太低了。
    • 显卡 AMD Radeon HD 6490M 256 MB ||处理器 Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz thx 尝试:)
    • 你有 160 个处理单元对吗?只需将 90000 除以 160,每个核心大约有 560 个像素。这是(可能)最佳比例。
    • “你尝试过默认或大于 1 的本地工作大小吗?”,我不这么认为。从示例代码 long local_work_size[] = new long[]{1};
    【解决方案2】:

    您应该为每个 300x300 图像使用整个工作组。这将有助于使 gpu 核心饱和并让您使用本地内存。内核还应该能够同时处理与设备上的计算单元一样多的图像。

    下面的内核分三步完成你的缩减。

    1. 将值读入每个工作项的一个私有单元
    2. 将私有变量写入本地内存(非常简单的步骤,但很重要)
    3. 减少本地内存中的值以获得最终值。此处显示了两种方法。

    WG_MAX_SIZE 的定义是因为我不喜欢传入可变大小的本地内存块。该值为 64,因为这是在大多数平台上使用的好值。如果您想尝试更大的工作组,请确保将此值设置得更高。小于 WG_MAX_SIZE 的工作组仍然可以正常工作。

    #define WORK_SIZE 90000
    #define WG_MAX_SIZE 64
    __kernel void sampleKernel(__global uint *a, __global uint *c)
    {
    
        local uint intensity_core[WG_MAX_SIZE];
        private uint workItemIntensity = 0;
    
        int gid = get_group_id(0);
        int lid = get_local_id(0);
        int wgsize = get_local_size(0);
        int i;
    
        for(i=gid*WORK_SIZE; i < (gid+1)*WORK_SIZE; i+=wgsize){ 
            workItemIntensity += a[j];
        }
        intensity_core[lid] = workItemIntensity;
        mem_fence(CLK_LOCAL_MEM_FENCE);
    
        //option #1
        //loop to reduce the final values O(n) time
        if(lid == 0){
            for(i=1;i<wgsize;i++){
                workItemIntensity += intensity_core[i];
            }
            c[gid]=intensity_core;
        }
    
        //option #2
        //O(logn) time reduction
        //assumes work group size is a power of 2
        int steps = 32 - clz(wgsize);
        for(i=1;i<steps;i++){
            if(lid % (1 << i) == 0){
                intensity_core[lid] += intensity_core[i<<(i-1)];
            }
            mem_fence(CLK_LOCAL_MEM_FENCE);
        }
        if(lid == 0){
            c[gid]=intensity_core[0];
        }
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-08-23
      • 2011-09-28
      • 2013-07-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多