0
votes

i am new to opencl and i want to actually parallelise this Sieve Prime, the C++ code is here: https://www.geeksforgeeks.org/sieve-of-atkin/

I somehow don't get the good results out of it, actually the CPU version is much faster after comparing. I tried to use NDRangekernel to avoid writing the nested loops and probably increase the performance but when i give higher limit number in function, the GPU driver stops responding and the program crashes. Maybe my NDRangekernel config is not ok, anyone could help with it? I probably don't get the NDRange properly, here are the info about my GPU.

  • CL_DEVICE_NAME: GeForce GT 740M

  • CL_DEVICE_VENDOR: NVIDIA Corporation

  • CL_DRIVER_VERSION: 397.31

  • CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU

  • CL_DEVICE_MAX_COMPUTE_UNITS: 2

  • CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3

  • CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64

  • CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024

  • CL_DEVICE_MAX_CLOCK_FREQUENCY: 1032 MHz

  • CL_DEVICE_ADDRESS_BITS: 32

  • CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte

  • CL_DEVICE_GLOBAL_MEM_SIZE: 2048 MByte

  • CL_DEVICE_ERROR_CORRECTION_SUPPORT: no

  • CL_DEVICE_LOCAL_MEM_TYPE: local

  • CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte

  • CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte

  • CL_DEVICE_QUEUE_PROPERTIES:
    -CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE

  • CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE

  • CL_DEVICE_IMAGE_SUPPORT: 1

  • CL_DEVICE_MAX_READ_IMAGE_ARGS: 256

  • CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16

    here is my NDRange code

    queue.enqueueNDRangeKernel(add, cl::NDRange(1,1), cl::NDRange((limit * limit) -1, (limit * limit) -1 ), cl::NullRange,NULL, &event);

and my kernel code:

__kernel void sieveofAktin(const int limit, __global bool* sieve)
{
int x = get_global_id(0);
int y = get_global_id(1);
//printf("%d \n", x);

int n = (4 * x * x) + (y * y);
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
    sieve[n] ^= true;

n = (3 * x * x) + (y * y);
if (n <= limit && n % 12 == 7)
    sieve[n] ^= true;


n = (3 * x * x) - (y * y);
if (x > y && n <= limit && n % 12 == 11)
    sieve[n] ^= true;
 
for (int r = 5; r * r < limit; r++) {
    if (sieve[r]) {
        for (int i = r * r; i < limit; i += r * r)
            sieve[i] = false;
    }
}

}
2

2 Answers

1
votes

You have lots of branching in that code, and I suspect that's what may be killing your performance on GPUs. Look at chapter 6 of the NVIDIA OpenCL Best Practices Guide for details on why this hurts performance.

I'm not sure how possible it is without looking closely at the algorithm, but ideally you want to rewrite the code to use as little branching as possible. Alternatively, you could look at other algorithms entirely.

As for the locking, I'd need to see more of your host code to know what is happening, but it's possible you're exceeding various limits of your platform/device. Are you checking for errors on every OpenCL function you call?

1
votes

Regardless of how good or bad your algorithm or implementation is - the driver should always respond. Non-response is quite possibly a bug. File a bug report at http://developer.nvidia.com/ .