guys.
I'm developing a sequence comparison application on OpenCL 1.2, and testing in a AMD R9 280X GPU. Here is the video card information:
DRIVER_VERSION: 1445.5 (VM)
Type: GPU
EXECUTION_CAPABILITIES: Kernel
GLOBAL_MEM_CACHE_TYPE: Read-Write (2)
CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
SINGLE_FP_CONFIG: 0xbe
QUEUE_PROPERTIES: 0x2
VENDOR_ID: 4098
MAX_COMPUTE_UNITS: 32
MAX_WORK_ITEM_DIMENSIONS: 3
MAX_WORK_GROUP_SIZE: 256
PREFERRED_VECTOR_WIDTH_CHAR: 4
PREFERRED_VECTOR_WIDTH_SHORT: 2
PREFERRED_VECTOR_WIDTH_INT: 1
PREFERRED_VECTOR_WIDTH_LONG: 1
PREFERRED_VECTOR_WIDTH_FLOAT: 1
PREFERRED_VECTOR_WIDTH_DOUBLE: 1
MAX_CLOCK_FREQUENCY: 1020
ADDRESS_BITS: 32
MAX_MEM_ALLOC_SIZE: 1073741824
IMAGE_SUPPORT: 1
MAX_READ_IMAGE_ARGS: 128
MAX_WRITE_IMAGE_ARGS: 8
IMAGE2D_MAX_WIDTH: 16384
IMAGE2D_MAX_HEIGHT: 16384
IMAGE3D_MAX_WIDTH: 2048
IMAGE3D_MAX_HEIGHT: 2048
IMAGE3D_MAX_DEPTH: 2048
MAX_SAMPLERS: 16
MAX_PARAMETER_SIZE: 1024
MEM_BASE_ADDR_ALIGN: 2048
MIN_DATA_TYPE_ALIGN_SIZE: 128
GLOBAL_MEM_CACHELINE_SIZE: 64
GLOBAL_MEM_CACHE_SIZE: 16384
GLOBAL_MEM_SIZE: 2893021184
MAX_CONSTANT_BUFFER_SIZE: 65536
MAX_CONSTANT_ARGS: 8
LOCAL_MEM_SIZE: 32768
ERROR_CORRECTION_SUPPORT: 0
PROFILING_TIMER_RESOLUTION: 1
ENDIAN_LITTLE: 1
AVAILABLE: 1
COMPILER_AVAILABLE: 1
MAX_WORK_GROUP_SIZES: 256 256 256
The program is correct and produces the right results (it also runs in other CPU and GPU processors), but the performance is very bad. It seems to me that OpenCL is not using all available cores. The same code runs 50X faster in a Nvidia GTX 680 card.
The code is a little complex, so I'm posting just the host code, once the OpenCL code executes correctly.
err = 0;
err = clSetKernelArg(kernel2, 0, sizeof(i0), &i0);
err |= clSetKernelArg(kernel2, 1, sizeof(i1), &i1);
err |= clSetKernelArg(kernel2, 2, sizeof(step), &step);
err |= clSetKernelArg(kernel2, 3, sizeof(cutBlock), &cutBlock);
err |= clSetKernelArg(kernel2, 4, sizeof(cl_mem), (void*) &op->d_blockResult);
err |= clSetKernelArg(kernel2, 5, sizeof(cl_mem), (void*) &op->d_busH);
err |= clSetKernelArg(kernel2, 6, sizeof(cl_mem), (void*) &op->d_extraH);
err |= clSetKernelArg(kernel2, 7, sizeof(cl_mem), (void*) &op->d_busV_h);
err |= clSetKernelArg(kernel2, 8, sizeof(cl_mem), (void*) &op->d_busV_e);
err |= clSetKernelArg(kernel2, 9, sizeof(cl_mem), (void*) &op->d_busV_o);
err |= clSetKernelArg(kernel2, 10, sizeof(cl_mem), (void*) &op->d_split_m);
err |= clSetKernelArg(kernel2, 11, sizeof(cl_mem), (void*) &op->t_seq0);
err |= clSetKernelArg(kernel2, 12, sizeof(cl_mem), (void*) &op->t_seq1);
if (err != CL_SUCCESS)
exit(0);
global = blocks * threads;
local = threads;
err = clGetKernelWorkGroupInfo(kernel2, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(max_work_groups), &max_work_groups, NULL);
if (err != CL_SUCCESS)
exit(0);
err = clEnqueueNDRangeKernel(commands, kernel2, 1, NULL, &global, &local, 0, NULL, &k_event);
if (err != CL_SUCCESS)
exit(0);
err = clWaitForEvents(1,&k_event);
err = clReleaseEvent(k_event);
In a typical execution, global=4096 and local=64, so the work size is greater than the number of GPU cores.
Any reason to OpenCL not to use all available cores? Could it be a driver bug?
Thanks in advance.
Marco