I'm working on an OpenCL program that calls the same kernel several times in a loop. When I use clEnqueueReadBuffer to transfer device memory back to the host it reports that the command queue is invalid.
Below is a function that is called to initiate a bitonic sort, it's shortened to make it more readable. The device list, context, command queue and kernel are created outside and passed to this function. list contains the list to be sorted, size is the number of elements in list.
cl_int OpenCLBitonicSort(cl_device_id device, cl_context context,
cl_command_queue commandQueue, cl_kernel bitonicSortKernel,
unsigned int * list, unsigned int size){
//create OpenCL specific variables
cl_int error = CL_SUCCESS;
size_t maximum_local_ws;
size_t local_ws;
size_t global_ws;
//create variables that keep track of bitonic sorting progress
unsigned int stage = 0;
unsigned int subStage;
unsigned int numberOfStages = 0;
//get maximum work group size
clGetKernelWorkGroupInfo(bitonicSortKernel, device,
CL_KERNEL_WORK_GROUP_SIZE, sizeof(maximum_local_ws),
&maximum_local_ws, NULL);
//make local_ws the largest perfect square allowed by OpenCL
for(i = 1; i <= maximum_local_ws; i *= 2){
local_ws = (size_t) i;
}
//total number of comparators will be half the items in the list
global_ws = (size_t) size/2;
//transfer list to the device
cl_mem list_d = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR,
size * sizeof(unsigned int), list, &error);
//find the number of stages needed (numberOfStages = ln(size))
for(numberOfStages = 0; (1 << numberOfStages ^ size); numberOfStages++){
}
//loop through all stages
for(stage = 0; stage < numberOfStages; stage++){
//loop through all substages in each stage
for(subStage = stage, i = 0; i <= stage; subStage--, i++){
//add kernel parameters
error = clSetKernelArg(bitonicSortKernel, 0,
sizeof(cl_mem), &list_d);
error = clSetKernelArg(bitonicSortKernel, 1,
sizeof(unsigned int), &size);
error = clSetKernelArg(bitonicSortKernel, 2,
sizeof(unsigned int), &stage);
error = clSetKernelArg(bitonicSortKernel, 3,
sizeof(unsigned int), &subStage);
//call the kernel
error = clEnqueueNDRangeKernel(commandQueue, bitonicSortKernel, 1,
NULL, &global_ws, &local_ws, 0, NULL, NULL);
//wait for the kernel to stop executing
error = clEnqueueBarrier(commandQueue);
}
}
//read the result back to the host
error = clEnqueueReadBuffer(commandQueue, list_d, CL_TRUE, 0,
size * sizeof(unsigned int), list, 0, NULL, NULL);
//free the list on the device
clReleaseMemObject(list_d);
return error;
}
In this code: clEnqueueReadBuffer says that the commandQueue is invalid. However it was valid when I called clEnqueueNDRangeKernel and clEnqueueBarrier.
When I set numberOfStages to just be 1 and stage to just be 0, so that clEnqueueNDRangeKernel is only called once, the code worked without returning errors (although the result was not correct). There is a problem with calling clEnqueueNDRangeKernel more than once (which I really need to do).
I'm on Mac OS 10.6 Snow Leopard and I'm using Apple's OpenCL 1.0 platform with an NVidia GeForce 9600m. Is running kernels within loops possible in OpenCL on other platforms? Has anyone had problems like this with OpenCL on OS X? What could be causing the command queue to be going invalid?