Here a draft of code I produced :
void __kernel myKernel(__global const short* input,
__global short* output,
const int width,
const int height){
// Always square. (and 16x16 in our example)
const uint local_size = get_local_size(0);
// Get the work-item col/row index
const uint wi_c = get_local_id(0);
const uint wi_r = get_local_id(1);
// Get the global col/row index
const uint g_c = get_global_id(0);
const uint g_r = get_global_id(1);
// Declare a local array NxN
const uint arr_size = local_size *local_size ;
__local short local_in[arr_size];
// Transfer the global memory for into a local one.
local_in[wi_c + wi_r*local_size ] = input[g_c + g_r*width];
// Wait that all the work-item are sync
barrier(CLK_LOCAL_MEM_FENCE);
// Now add code to process on the local array (local_in).
As far as I understand OpenCL work-group/work-item, this is what I need to do to copy a global 16x16 ROI of from global to local memory. (Please correct me if I'm wrong, since I'm beginning at this).
So after the barrier, each element in local_in can be access via wi_c + wi_r*local_size
.
But now let's do something tricky. If I want for each work-item in my work group to work on a 3x3 neighborhood, I will need a 18x18 local_in array.
But how to create this ? Since I have only 16x16=256 work-item (threads), but I need 18x18=324 (missing 68 threads to do it).
My basic idea should be to do:
if(wi_c == 0 && wi_r == 0){
// Code that copy the border into the new array that should be
// local_in[(local_size+2)*(local_size+2)];
}
But this is terrible, since the first work-item (1st thread) will have to handle all the border and the rest of the work-items in this group will just be waiting this 1st work-item to finish. (Again, this is my understanding of OpenCL, might be wrong).
So here are my real question:
- Is there another easier solution for this kind of problem ? Like changing the NDRange Local size to be overlapping or something ?
- I start to read about coalesced memory access, is my first draft of code look like it ? I don't think so, since I'm using a "stride" approach to load the global memory. But I don't understand how I could change the first part of that code to be efficient also.
- Once the barrier is reached, the processing continue of each work-item to get a final value that need to be stored back into the global output array. Should I put again a barrier before this "write" or all good to leave all the work-item finish their self ?