0
votes

guys.

I'm developing a sequence comparison application on OpenCL 1.2, and testing in a AMD R9 280X GPU. Here is the video card information:

DRIVER_VERSION: 1445.5 (VM)
Type: GPU
EXECUTION_CAPABILITIES: Kernel
GLOBAL_MEM_CACHE_TYPE: Read-Write (2)
CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
SINGLE_FP_CONFIG: 0xbe
QUEUE_PROPERTIES: 0x2
VENDOR_ID: 4098
MAX_COMPUTE_UNITS: 32
MAX_WORK_ITEM_DIMENSIONS: 3
MAX_WORK_GROUP_SIZE: 256
PREFERRED_VECTOR_WIDTH_CHAR: 4
PREFERRED_VECTOR_WIDTH_SHORT: 2
PREFERRED_VECTOR_WIDTH_INT: 1
PREFERRED_VECTOR_WIDTH_LONG: 1
PREFERRED_VECTOR_WIDTH_FLOAT: 1
PREFERRED_VECTOR_WIDTH_DOUBLE: 1
MAX_CLOCK_FREQUENCY: 1020
ADDRESS_BITS: 32
MAX_MEM_ALLOC_SIZE: 1073741824
IMAGE_SUPPORT: 1
MAX_READ_IMAGE_ARGS: 128
MAX_WRITE_IMAGE_ARGS: 8
IMAGE2D_MAX_WIDTH: 16384
IMAGE2D_MAX_HEIGHT: 16384
IMAGE3D_MAX_WIDTH: 2048
IMAGE3D_MAX_HEIGHT: 2048
IMAGE3D_MAX_DEPTH: 2048
MAX_SAMPLERS: 16
MAX_PARAMETER_SIZE: 1024
MEM_BASE_ADDR_ALIGN: 2048
MIN_DATA_TYPE_ALIGN_SIZE: 128
GLOBAL_MEM_CACHELINE_SIZE: 64
GLOBAL_MEM_CACHE_SIZE: 16384
GLOBAL_MEM_SIZE: 2893021184
MAX_CONSTANT_BUFFER_SIZE: 65536
MAX_CONSTANT_ARGS: 8
LOCAL_MEM_SIZE: 32768
ERROR_CORRECTION_SUPPORT: 0
PROFILING_TIMER_RESOLUTION: 1
ENDIAN_LITTLE: 1
AVAILABLE: 1
COMPILER_AVAILABLE: 1
MAX_WORK_GROUP_SIZES: 256 256 256

The program is correct and produces the right results (it also runs in other CPU and GPU processors), but the performance is very bad. It seems to me that OpenCL is not using all available cores. The same code runs 50X faster in a Nvidia GTX 680 card.

The code is a little complex, so I'm posting just the host code, once the OpenCL code executes correctly.

    err = 0;
    err  = clSetKernelArg(kernel2, 0, sizeof(i0), &i0);
    err |= clSetKernelArg(kernel2, 1, sizeof(i1), &i1);
    err |= clSetKernelArg(kernel2, 2, sizeof(step), &step);
    err |= clSetKernelArg(kernel2, 3, sizeof(cutBlock), &cutBlock);
    err |= clSetKernelArg(kernel2, 4, sizeof(cl_mem), (void*) &op->d_blockResult);
    err |= clSetKernelArg(kernel2, 5, sizeof(cl_mem), (void*) &op->d_busH);
    err |= clSetKernelArg(kernel2, 6, sizeof(cl_mem), (void*) &op->d_extraH);
    err |= clSetKernelArg(kernel2, 7, sizeof(cl_mem), (void*) &op->d_busV_h);
    err |= clSetKernelArg(kernel2, 8, sizeof(cl_mem), (void*) &op->d_busV_e);
    err |= clSetKernelArg(kernel2, 9, sizeof(cl_mem), (void*) &op->d_busV_o);
    err |= clSetKernelArg(kernel2, 10, sizeof(cl_mem), (void*) &op->d_split_m);
    err |= clSetKernelArg(kernel2, 11, sizeof(cl_mem), (void*) &op->t_seq0);
    err |= clSetKernelArg(kernel2, 12, sizeof(cl_mem), (void*) &op->t_seq1);

    if (err != CL_SUCCESS)
        exit(0);

    global = blocks * threads;
    local = threads;

    err = clGetKernelWorkGroupInfo(kernel2, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(max_work_groups), &max_work_groups, NULL);
    if (err != CL_SUCCESS)
        exit(0);

    err = clEnqueueNDRangeKernel(commands, kernel2, 1, NULL, &global, &local, 0, NULL, &k_event);
    if (err != CL_SUCCESS)
        exit(0);

    err = clWaitForEvents(1,&k_event);
    err = clReleaseEvent(k_event);

In a typical execution, global=4096 and local=64, so the work size is greater than the number of GPU cores.

Any reason to OpenCL not to use all available cores? Could it be a driver bug?

Thanks in advance.

Marco

1
How do you know it isn't running on all cores?talonmies
So you real concern is that "the performance is very bad". Did you profile your kernel to see what is actually going on in there?void_ptr
Hi, talonmies and void_ptr. First of all, thanks for your help.Marcoacf
I couldn't profile the application yet, but here is some extra information. The application compares two huge sequences (more than 1M of characters) The kernels are compiled with THREADS=128. The code runs also in 8-cores CPUS. In the AMD 280X GPU, the application (exactly same code) is just 2X faster than CPUs execution. In two Nvidia GPU cards (GTX 580 and GTX 680) the application runs at least 50X faster. During the execution, the application prints the partial results: in the Nvidia GPU, more than 200 blocks of cells are processes on each step; in the AMD GPU, just 8 blocks. Any thoughts?Marcoacf
Without seeing kernel code, and without profiling results, it is impossible to tell what's up with the performance. I suggest you add those details. You also mentioned printing partial results - if that happens from the kernel code, it'll kill your performance.void_ptr

1 Answers

1
votes
 global=4096 and local=64, so the work size is greater than the number of GPU cores.

This is not enough. Your gpu has 2048 cores but the kernel execution overhead dwarves the execution itself because each core working only twice. You need at least 8192,16384 or even 1M global size depending on your algortihm working on GPU. You need to keep those cores busy while memory units are busy too.

Nvidia gpu can have local=1024 and finish whole work in just 4 smx blocks. (increasing inter-thhread communication capabilities). There is a possibility of your card overpowers when the work is large enough.

Are you keeping buffer upload-togpu and buffer-download-from-gpu timings out of performance calculations?

Any reason to OpenCL not to use all available cores? 

Since opencl doesn't tell us individual core usage, you can only say "its ok" if your gpu heats as much as the gtx680 in belief of "heat sources are cores' semiconductors".

The same code runs 50X faster in a Nvidia GTX 680 card.

Same code but CUDA? OpenCL? With interop? Did you test all yourself? Against single core-cpu or all cores or all cores with sse/avx? Same driver versions? Same os? Same environments?

What is your algorithm? If thats matrix multiplication, more inter-thread communication is needed, use local=256(nvidia has 1024) Are you using async copies? Your codes have any fake recursivity?