0
votes

I have a program running under OpenCL where after I perform the calculations in private memory, I would like to write them to Global memory. I have no use for the results further down the road-essentially I am looking for a built in solution to write to Global memory from either __local or __private memory asynchronously.

I already tried async_work_group_copy and I noticed that in order to ensure the data is correctly copied I have to wait for the event. For my card AMD HD7970 this is the same as doing a synchronous copy directly to Global memory.

Does anyone have any experience with async_work_group_copy without waiting for the event or any other viable alternative?

for (...) {
//Calculate some results and copy to __local array src
event_t e = async_work_group_copy(dest, src, size, 0);
wait_group_events(1, &e);  //Can we safely skip this??
}

Here src is __local and dest is __global.

I suspect that since this function has to be identical for the whole Group, skipping waiting for the event may not work since other local work items may not have finished. This is in a for loop which complicates things further.

1
so your kernel has to do more work after the copy, but not with that global data?mfa
Yes. The kernel essentially manufactures data and puts them in Global memory. There is some input data but that is read once at the beginning. The manufactured data does not need to be consumed by the calculation. Any data I need I keep in private memory at all times.panos

1 Answers

2
votes

I think there isn't much you have to (can) do in this situation. I know that the Intel's GPU implementation will not stall on a global write unless there's a register dependency hazard to soon after the write (e.g. if the program reuses that register too soon after the write, it'll stall until the dependency hazard clears). Sadly, you can't really control register allocation or even see it unfortunately though.