Opencl - reduction in the rate of the number of work-item

Question

I'm trying calculate crc32 for multithread. I'm trying use OpenCL. The GPU code is:

__kernel void crc32_Sarwate( __global int* lenghtIn, 
                         __global unsigned char *In, 
                         __global int *OutCrc32,
                            int size ) {
int i, j, len;

i = get_global_id( 0 ); 
if( i >= size )
    return;
len = j = 0;
while( j != i )
    len += lenghtIn[ j++ ];
OutCrc32[ i ] = crc32( In + len, lenghtIn[ i ] ); }

I received this results( time ) with a thousand repetitions:
for 4 using work-item: 29.82
for 8 using work-item: 29.9
for 16 using work-item: 35.16
for 32 using work-item: 35.93
for 64 using work-item: 38.69
for 128 using work-item: 52.83
for 256 using work-item: 152.08
for 512 using work-item: 333.63

I have intel HD Graphics with 350 MHz and 3 work-group with 256 work-item each work-group. I assumed that by increasing the number of work-item 128 to 256 happen slight increase in time, but time tripled. Why? ( I'm sorry for my very bad English ).

huseyin tugrul buyukisik huseyin tugrul buyukisik · Accepted Answer · 2015-01-31T14:18:58

The

while( j != i )
    len += lenghtIn[ j++ ];

part runs for get_global_id( 0 ) times.

When it is 128, the latest work item to complete is doing 128 loop iterations.

When it is 256, it is doing 256 iterations so it should be %100 increase from memory's point of view but only for the last work item. When we integrate all workers' total memory access numbers,

 1 item from 0 to 0 ---> 1 access
 2 item from 0 to 0 and 0 to 1  ---> 3 access
 4 item from 0 to 0 and 0 to 1 and 0 to 2 and 0 to 3---> 10 access
 8 items: SUM(1 to 8) => 36 accesses  
 16 items: SUM(1 to 16) => 136 accesses (even more than + %200)
 32 items: => 528 (~ %400)
 64 items: => 2080 ( ~%400)
 128 items: => 8256 (~%400) (cache of your igpu starts failing here)
 256 items: => 32896 (~400%) (now caching is saturated and you start )
                              ( seeing %400 per doubling of work items)

 512 => uses second compute unit too! But %400 work is done 
  so it is not only %200 time consuming.

so each time you increase work items by %100, you increase total memory accesses to %400 . But caching helps up to some degree. When you cross that, memory accesses increase badly. Alse the execution overhead(drivers,..) becomes unimportant.

You are accessing to memory non-parallel. You need to cache it first but it may not be possible in that hardware so you should distribute the job equally among workitems and make memory accesses contiguous between cores(vectorize). This should give more performance.

For now, each vector unit does:

unit        :   v0 v1 v2 v3 v4 ... v7
read address:   0  0  0  0  0      0
                -  1  1  1  1      1
                -  -  2  2  2      2
                -  -  -  3  3      3
                -  -  -  -  4      4
                     ....  
                -  -  -  -  - ...  7

done in 8 steps on 8 streaming cores.

At the last step, only single work item is actually computing something. This should be something like:

Some Optimization

unit        :   v0 v1 v2 v3  no need other work items
read address:   0  0  0  0  \
                1  1  1  1   \
                2  2  2  2    \
                3  3  3  3    / this is 5th work item's work
                4  4  4  4   /
                5  5  5  0   \
                6  6  0  1    \ this is 0 to 3 as 4th work
                7  0  1  2    /
 first item<--  0  1  2  3   /

done in 8 steps in only 4 streaming cores and is doing same job for the first half part(probably faster).

Further Optimization Suggestion

I think it would be better with a prefix-scan(sum) algorithm on another kernel before getting to crc32() part. (probably in just 3 steps for this example rather than 8 and also more efficient)

Using precomputed values of

while( j != i )
        len += lenghtIn[ j++ ];

should make crc32 immune to the current algorithm complexity (O(n²)).

Opencl - reduction in the rate of the number of work-item

1 Answers