Intel HD Graphics 4000 和 Nvidia GeForce GT 650M 上的 openCL 不工作：CL_INVALID_DEVICE 错误答案

【问题标题】：openCL on Intel HD Graphics 4000 & Nvidia GeForce GT 650M not working: CL_INVALID_DEVICE errorIntel HD Graphics 4000 和 Nvidia GeForce GT 650M 上的 openCL 不工作：CL_INVALID_DEVICE 错误
【发布时间】：2014-10-09 04:49:02
【问题描述】：

所以我在让我的代码在某些 openCL 设备上运行时遇到了一些问题。我正在 OSX 10.9.5 (Mavericks) 上的 2013 年中期 15" 视网膜屏幕 Macbook pro 上开发并使用 Xcode 6.0.1

使用 clGetDeviceIDs 访问所有可用设备并使用 clGetDeviceInfo 查看每个设备的信息后，我得到以下信息：

Device: Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz
Hardware version: OpenCL 1.2 
Software version: 1.1
OpenCL C version: OpenCL C 1.2 
Parallel compute units: 8

Device: HD Graphics 4000
Hardware version: OpenCL 1.2 
Software version: 1.2(Aug 17 2014 20:29:07)
OpenCL C version: OpenCL C 1.2 
Parallel compute units: 16

Device: GeForce GT 650M
Hardware version: OpenCL 1.2 
Software version: 8.26.28 310.40.55b01
OpenCL C version: OpenCL C 1.2 
Parallel compute units: 2

所以根据这个，我应该有 1 个 CPU 和 2 个 GPU 可用：一个 HD Graphics 4000 和一个 GeForce GT 650M。

我的问题是，当我尝试调用 clGetkernelWorkGroupInfo 时，如果我传入两个 GPU 之一的设备 ID，它会返回一个 CL_INVALID_DEVICE 错误，但如果我传入 CPU id 并且会毫无问题地计算我的内核代码，它工作得非常好。

这很奇怪，因为在那之前我的所有其他调用都适用于所有 3 台设备。我可以创建一个包含所有 3 个设备的上下文，创建 3 个单独的命令队列（每个设备一个），我可以编译一个程序并创建内核就好了。但是，一旦我接到那个电话，它就会说我的设备无效。

如果我注释掉对 clGetKernelWorkGroupInfo 的调用并指定我自己的全局/本地工作大小，当我尝试使用 CL_INVALID_PROGRAM_EXECUTABLE 错误调用 clEnqueueNDRangeKernel 时会收到错误。

我电脑上安装的显卡有问题吗？还是有一些我不知道的代码方面的事情？我只是不知道设备如何在那个呼叫之前有效，然后突然无效。

编辑这是我的代码（CheckError 只是我制作的一个函数，如果出现错误，它会打印出自定义错误消息）

cl_int err; //Error catcher
cl_platform_id platform; //Computer platform
cl_context context; //Single context for whole platform
cl_uint deviceCount; //Number of devices (CPU + GPU) available on machine
cl_device_id *devices; //Array of pointers to devices;
cl_program program; //OpenCL program
cl_command_queue *commandQueues; //One command queue for each device

/*---Definitions---*/
int DATA_SIZE = 16384;
double results[DATA_SIZE];    //  results returned from device;
int currDevice = 0;           //Use this to just access first available device

/*---Get First Platform---*/
err = clGetPlatformIDs(1, &platform, NULL);
CheckError(err, "A valid platform could not be found on this machine");

/*---Get Device Count---*/
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &deviceCount);
CheckError(err, "Could not determine the number of devices available on this platform");

/*---Get All Devices---*/
devices = new cl_device_id[deviceCount];
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, deviceCount, devices, NULL);
CheckError(err, "Could not access the devices");

/*---Create a single context for all devices---*/
context = clCreateContext(NULL, deviceCount, devices, NULL, NULL, &err);
CheckError(err, "Could not create a context on this platform");

/*---For each device create a separate command queue---*/
commandQueues = new cl_command_queue[deviceCount];
for(int i = 0; i < deviceCount; i++)
{
    commandQueues[i] = clCreateCommandQueue(context, devices[i], 0, &err);
    string errMsg = "Was unable to successfully set up a command queue for device number " + to_string(i);
    CheckError(err, errMsg);
}


/*---Read in cl file---*/
char *KernelSource = ReadFile("./Source/Sampling/Sampler.cl");

//  Create the compute program from the source buffer
program = clCreateProgramWithSource(context, 1, (const char **) & KernelSource, NULL, &err);
CheckError(err, "Failed to create compute program!");

//   Build the program executable
err = clBuildProgram(program, deviceCount, devices, NULL, NULL, NULL);
if (err != CL_SUCCESS)
{
    size_t len;
    char buffer[2048];

    printf("Error: Failed to build program executable!\n");
    clGetProgramBuildInfo(program, devices[currDevice], CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len);
    printf("%s\n", buffer);
    exit(1);
}

//  Create the compute kernel in the program we wish to run
cl_kernel kernel = clCreateKernel(program, "mySampler", &err);
CheckError(err, "Failed to create compute kernel!");

// Create the input array in device memory for our calculation
cl_mem input = clCreateBuffer(context,  CL_MEM_READ_ONLY,  sizeof(double) * DATA_SIZE, NULL, &err);
CheckError(err, "Failed to allocate device memory");

//   Set the arguments to our compute kernel
err  = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
CheckError(err, "Failed to set kernel arguments");

size_t global, local;

//   Get the maximum work group size for executing the kernel on the device
err = clGetKernelWorkGroupInfo(kernel, devices[currDevice], CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL);
CheckError(err, "Failed to retrieve work group info!");

//   Execute the kernel over the entire range of our 1d input data set
//   using the maximum number of work group items for this device
global = DATA_SIZE;
err = clEnqueueNDRangeKernel(commandQueues[currDevice], kernel, 1, NULL, &global, &local, 0, NULL, NULL);
CheckError(err, "Failed to execute kernel!");

//  Wait for the command commands to get serviced before reading back results
clFinish(commandQueues[currDevice]);

//  Read back the results from the device to verify the output
err = clEnqueueReadBuffer(commandQueues[currDevice], input, CL_TRUE, 0, sizeof(double) * DATA_SIZE, results, 0, NULL, NULL );
CheckError(err, "Failed to read array");

std::cout<<"DONE!"<<std::endl;
for(int i = 0; i < DATA_SIZE; i++)
{
    std::cout<<"RESULT: "<<i<<" "<<results[i]<<std::endl;
}

//  Shutdown and cleanup
clReleaseMemObject(input);
clReleaseProgram(program);
clReleaseKernel(kernel);
clReleaseCommandQueue(commandQueues[currDevice]);
clReleaseContext(context);

}

【问题讨论】：

听起来您已经为 CPU 构建了内核，然后尝试在 GPU 上使用它。您能否向我们展示您选择平台、构建程序然后执行这些查询的主机代码？
好的，我会在几秒钟内更新我的帖子

标签： xcode macos opencl gpu cpu

【解决方案1】：

我认为该程序无法为您的一个或两个 GPU 构建。我刚刚在我自己的 OS X 系统上检查了这一点，clBuildProgram() 返回 CL_SUCCESS 如果它能够为您传递的任何设备构建程序，即使构建其他设备失败。

如果您在clBuildProgram() 调用之后添加此代码，您可以检查构建是否实际成功：

for (int i = 0; i < deviceCount; i++)
{
  cl_build_status status;
  clGetProgramBuildInfo(program, devices[i], CL_PROGRAM_BUILD_STATUS,
                        sizeof(status), &status, NULL);
  std::cout << "Build status for device " << i << " = " << status << std::endl;
}

我注意到您正在使用 double 值 - HD 4000 不支持双精度，使用 double 类型的内核将无法构建。当编译一个使用double 和你的主机代码（和上面的代码sn-p）的内核时，我得到以下输出：

Build status for device 0 = 0
Build status for device 1 = -2
Build status for device 2 = 0

如您所见，两台设备的构建都成功了，但设备 1（即 HD 4000）没有成功。

因此，在 Apple 系统上同时为多台设备构建程序时，您似乎需要小心。

【讨论】：

感谢您抽出宝贵时间帮助我解决问题。事实证明你是对的，我的一些内核代码涉及双打，这搞砸了程序构建。我认为虽然我可以通过在我的内核文件中添加：“#pragma OPENCL EXTENSION cl_khr_fp64 : enable”来解决双重问题，但我想不是