OpenCL the same algorithms for GPU and CPU but OpenCl works differently for those two devices

Question

I have two programs written by me from scratch, one is integral and second matrix-matrix multiplication. When I was executing both programs with GPU cards, and I set global size to 1024 I expected kernel code to execute 1024 times, and it was correct, it was executing the same amount of times as I set on global size, and changing local size did not matter to code results and to output. The same code I tried to execute with CPU, and I was shocked when I saw that kernel function does not executes the same amount of times as in global size was set. Here is the example from integral: global size = 2048, local size = 1, I am expecting 2048 executions of kernel function, and yes, it is 2048, but when we have global size = 2048 and local size = 16 then it executes 256 times... Is it normal? Why working with CPU is different in openCl than with GPU? I thought that it does not matter for user side which device we use, the same code should work the same on different devices. Am I wrong guys?

Thank you in advance for help!

We need a Minimal, Compilable, Veriable Example of the problem. So please post some code and how you're enqueuing it (and with which OpenCL driver etc.) In general, the driver is supposed to ensure the entire grid is executed, i.e. local work size times global work size threads executing overall. — einpoklum
There must be wrong interception of global thread id and group id and local thread id. How are you checking number of executions? — huseyin tugrul buyukisik
@huseyintugrulbuyukisik I just simply add +=1 to global variable in kernel code to see how many times kernel was executed — Gzyniu
@Gzyniu is it atomic or a simple one? I mean, did you use atomic_add() or really += ? — huseyin tugrul buyukisik
@huseyintugrulbuyukisik simple one, but I checked atomic_add and same results here — Gzyniu

huseyin tugrul buyukisik huseyin tugrul buyukisik · Accepted Answer · 2017-01-06T23:23:02

Use atomic operations for serial work (or at least non-easily-reducible). To count how many threads participated, don't use a[0]+=1;

atomic_add(&a[0],1);

should work or even better

atomic_inc(a)

where a is an integer, unsigned signed doesn't matter.

OpenCL the same algorithms for GPU and CPU but OpenCl works differently for those two devices

2 Answers