GPU Driver does not respond after NDRangekernel increase

Question

i am new to opencl and i want to actually parallelise this Sieve Prime, the C++ code is here: https://www.geeksforgeeks.org/sieve-of-atkin/

I somehow don't get the good results out of it, actually the CPU version is much faster after comparing. I tried to use NDRangekernel to avoid writing the nested loops and probably increase the performance but when i give higher limit number in function, the GPU driver stops responding and the program crashes. Maybe my NDRangekernel config is not ok, anyone could help with it? I probably don't get the NDRange properly, here are the info about my GPU.

CL_DEVICE_NAME: GeForce GT 740M
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 397.31
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1032 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 2048 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:
-CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 256
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16

here is my NDRange code

queue.enqueueNDRangeKernel(add, cl::NDRange(1,1), cl::NDRange((limit * limit) -1, (limit * limit) -1 ), cl::NullRange,NULL, &event);

and my kernel code:

__kernel void sieveofAktin(const int limit, __global bool* sieve)
{
int x = get_global_id(0);
int y = get_global_id(1);
//printf("%d \n", x);

int n = (4 * x * x) + (y * y);
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
    sieve[n] ^= true;

n = (3 * x * x) + (y * y);
if (n <= limit && n % 12 == 7)
    sieve[n] ^= true;


n = (3 * x * x) - (y * y);
if (x > y && n <= limit && n % 12 == 11)
    sieve[n] ^= true;
 
for (int r = 5; r * r < limit; r++) {
    if (sieve[r]) {
        for (int i = r * r; i < limit; i += r * r)
            sieve[i] = false;
    }
}

}

colonel of truth colonel of truth · Accepted Answer · 2018-04-28T23:49:29

You have lots of branching in that code, and I suspect that's what may be killing your performance on GPUs. Look at chapter 6 of the NVIDIA OpenCL Best Practices Guide for details on why this hurts performance.

I'm not sure how possible it is without looking closely at the algorithm, but ideally you want to rewrite the code to use as little branching as possible. Alternatively, you could look at other algorithms entirely.

As for the locking, I'd need to see more of your host code to know what is happening, but it's possible you're exceeding various limits of your platform/device. Are you checking for errors on every OpenCL function you call?

GPU Driver does not respond after NDRangekernel increase

2 Answers