I have a program running under OpenCL where after I perform the calculations in private memory, I would like to write them to Global memory. I have no use for the results further down the road-essentially I am looking for a built in solution to write to Global memory from either __local or __private memory asynchronously.
I already tried async_work_group_copy and I noticed that in order to ensure the data is correctly copied I have to wait for the event. For my card AMD HD7970 this is the same as doing a synchronous copy directly to Global memory.
Does anyone have any experience with async_work_group_copy without waiting for the event or any other viable alternative?
for (...) {
//Calculate some results and copy to __local array src
event_t e = async_work_group_copy(dest, src, size, 0);
wait_group_events(1, &e); //Can we safely skip this??
}
Here src is __local and dest is __global.
I suspect that since this function has to be identical for the whole Group, skipping waiting for the event may not work since other local work items may not have finished. This is in a for loop which complicates things further.