I was reading some results. And there I saw that they used 5120 work-groups and a local-size of 1. I have a limited knowledge about OpenCl and I was wondering if this statement is correct:
As can be seen for the GPU, the first test has 5120 work-groups, with 1 work-item each. This means that the threads which are executed in parallel are limited to the amount of computing units there are in the machine. For example if a GPU has 20 computing units there can only be a maximum of 20 threads which are working in parallel. Though when the local size is increased to 2, twice the amount of threads are run simultaneously
From reading some info on OpenCl, it seems about right. Though I need a second opinion.