I have an image processing kernel that uses a buffer of flags that is too large to fit into local memory. The flags are accessed in predictable, raster pattern (upper left to lower right hand side).
My idea is to store flags in global memory, and use local memory as a cache for global. So, as I progress along the raster pattern, I want to read flags from global to local, do some processing, then write flags back to global. But, I want to hide the latency involved.
So, suppose I access my image as a series of locations: a1,a2,a3......
I want to do the following:
- fetch
a1flags - fetch
a2flags - while
a2flags are being fetched, processa1location and store back to global memory - fetch
a3flags - while
a3flags are being fetched, processa2location and store back to global memory - etc.
How should I structure my code to ensure that the latency is hidden ?
Do I need to use vload/vstore to do this? Or will the GPU hardware
do the latency hiding automatically ?
barrier? - Emanuele Gionabarrieris needed to ensure all work items reach a certain point in the kernel before allowing them to proceed. But, this doesn't affect latency of memory transactions. - Jacko