0
votes

I am implementing an algorithm on GPU using Open CL.

Currently I am launching kernel with only one work-group containing 128 work-items.The data in global memory is being used many times by every work-item .To take advantage of speed of shared memory I copied it to shared memory using the following code.

__kernel void kernel1(__global float2* input,
                   __global int* bin,
                   __global float2* DFT,
                   __local float2* localInput,
                   __const int N){

     size_t itemId = get_local_id(0);
     localInput[itemId] = input[itemId];
     barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
     ........................................................
     /*Remaining algo here.*/
     ........................................................


 }

The above code works well if there is only one work group.But if there are more than one work-group,assuming there are two work-groups with equal number of items in each of them the above kernel copies only the first half in the first work-group shared memory and the second-half in the later.

I also tried the below kernel :

__kernel void kernel1(__global float2* input,
                   __global int* bin,
                   __global float2* DFT,
                   __local float2* localInput,
                   __const int N){

     size_t itemId = get_local_id(0);
     if(itemId == 0){
         for(int index = 0;index<N;index++){
             localInput[index] = input[index];        
         }
     }
     barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
     ........................................................
     /*Remaining algo here.*/
     ........................................................

 }

But the above code has problems like divergence because of conditional statements which decreases the performance.

What further modifications can be done to the code so that entire array can be copied to shared memory of each work-group efficiently?

Any suggestions are well appreciated.

1

1 Answers

0
votes

Depending on what device you're running on, there's a good chance you can completely ignore local memory. If you're on a desktop GPU, they used to have practically no cache whatsoever which made using local memory very important, but these days they have a decent amount. If you're hitting the same portion of memory on a gpu itll all be in cache (its generally the same size as shared memory), which is just as fast as local memory (they're the same block of memory, just split). Copying it manually to local memory might additionally impose a minor performance penalty

If you aren't on a desktop GPU (arm/etc) or your requirements make this impractical, async_work_group_copy might be what you are looking for

On an unrelated note, the above code only needs to do a barrier(CLK_LOCAL_MEM_FENCE) as you presumably aren't modifying your input