The
while( j != i )
len += lenghtIn[ j++ ];
part runs for get_global_id( 0 ) times.
When it is 128, the latest work item to complete is doing 128 loop iterations.
When it is 256, it is doing 256 iterations so it should be %100 increase from memory's point of view but only for the last work item. When we integrate all workers' total memory access numbers,
1 item from 0 to 0 ---> 1 access
2 item from 0 to 0 and 0 to 1 ---> 3 access
4 item from 0 to 0 and 0 to 1 and 0 to 2 and 0 to 3---> 10 access
8 items: SUM(1 to 8) => 36 accesses
16 items: SUM(1 to 16) => 136 accesses (even more than + %200)
32 items: => 528 (~ %400)
64 items: => 2080 ( ~%400)
128 items: => 8256 (~%400) (cache of your igpu starts failing here)
256 items: => 32896 (~400%) (now caching is saturated and you start )
( seeing %400 per doubling of work items)
512 => uses second compute unit too! But %400 work is done
so it is not only %200 time consuming.
so each time you increase work items by %100, you increase total memory
accesses to %400 . But caching helps up to some degree. When you cross that, memory accesses increase badly. Alse the execution overhead(drivers,..) becomes unimportant.
You are accessing to memory non-parallel. You need to cache it first but it may not be possible in that hardware so you should distribute the job equally among workitems and make memory accesses contiguous between cores(vectorize). This should give more performance.
For now, each vector unit does:
unit : v0 v1 v2 v3 v4 ... v7
read address: 0 0 0 0 0 0
- 1 1 1 1 1
- - 2 2 2 2
- - - 3 3 3
- - - - 4 4
....
- - - - - ... 7
done in 8 steps on 8 streaming cores.
At the last step, only single work item is actually computing something. This should be something like:
Some Optimization
unit : v0 v1 v2 v3 no need other work items
read address: 0 0 0 0 \
1 1 1 1 \
2 2 2 2 \
3 3 3 3 / this is 5th work item's work
4 4 4 4 /
5 5 5 0 \
6 6 0 1 \ this is 0 to 3 as 4th work
7 0 1 2 /
first item<-- 0 1 2 3 /
done in 8 steps in only 4 streaming cores and is doing same job for the first
half part(probably faster).
Further Optimization Suggestion
I think it would be better with a prefix-scan(sum) algorithm on another kernel before getting to crc32() part. (probably in just 3 steps for this example rather than 8 and also more efficient)
Using precomputed values of
while( j != i )
len += lenghtIn[ j++ ];
should make crc32 immune to the current algorithm complexity (O(n²)).