I have two programs written by me from scratch, one is integral and second matrix-matrix multiplication. When I was executing both programs with GPU cards, and I set global size to 1024 I expected kernel code to execute 1024 times, and it was correct, it was executing the same amount of times as I set on global size, and changing local size did not matter to code results and to output. The same code I tried to execute with CPU, and I was shocked when I saw that kernel function does not executes the same amount of times as in global size was set. Here is the example from integral: global size = 2048, local size = 1, I am expecting 2048 executions of kernel function, and yes, it is 2048, but when we have global size = 2048 and local size = 16 then it executes 256 times... Is it normal? Why working with CPU is different in openCl than with GPU? I thought that it does not matter for user side which device we use, the same code should work the same on different devices. Am I wrong guys?
Thank you in advance for help!