I am implementing an algorithm on GPU using Open CL.
Currently I am launching kernel with only one work-group containing 128 work-items.The data in global memory is being used many times by every work-item .To take advantage of speed of shared memory I copied it to shared memory using the following code.
__kernel void kernel1(__global float2* input,
__global int* bin,
__global float2* DFT,
__local float2* localInput,
__const int N){
size_t itemId = get_local_id(0);
localInput[itemId] = input[itemId];
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
........................................................
/*Remaining algo here.*/
........................................................
}
The above code works well if there is only one work group.But if there are more than one work-group,assuming there are two work-groups with equal number of items in each of them the above kernel copies only the first half in the first work-group shared memory and the second-half in the later.
I also tried the below kernel :
__kernel void kernel1(__global float2* input,
__global int* bin,
__global float2* DFT,
__local float2* localInput,
__const int N){
size_t itemId = get_local_id(0);
if(itemId == 0){
for(int index = 0;index<N;index++){
localInput[index] = input[index];
}
}
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
........................................................
/*Remaining algo here.*/
........................................................
}
But the above code has problems like divergence because of conditional statements which decreases the performance.
What further modifications can be done to the code so that entire array can be copied to shared memory of each work-group efficiently?
Any suggestions are well appreciated.