Using local/shared memory as a cache for global

Question

I have an image processing kernel that uses a buffer of flags that is too large to fit into local memory. The flags are accessed in predictable, raster pattern (upper left to lower right hand side).

My idea is to store flags in global memory, and use local memory as a cache for global. So, as I progress along the raster pattern, I want to read flags from global to local, do some processing, then write flags back to global. But, I want to hide the latency involved.

So, suppose I access my image as a series of locations: a1,a2,a3...... I want to do the following:

fetch a1 flags
fetch a2 flags
while a2 flags are being fetched, process a1 location and store back to global memory
fetch a3 flags
while a3 flags are being fetched, process a2 location and store back to global memory
etc.

How should I structure my code to ensure that the latency is hidden ? Do I need to use vload/vstore to do this? Or will the GPU hardware do the latency hiding automatically ?

barrier is needed to ensure all work items reach a certain point in the kernel before allowing them to proceed. But, this doesn't affect latency of memory transactions. — Jacko

Florent DUGUET Florent DUGUET · Accepted Answer · 2017-09-13T11:32:07

The CUDA Surface concept might be a good tool for your case. The access pattern is optimized for image processing, and it uses the texture cache, so no need to perform caching yourself. The texture cache is per block, hence you may want to use 2D thread distribution to have small squares processed by a single block.

Hiding latency is naturally done by scheduling more threads and blocks than hardware can simultaneously process. Depending on the Compute Capability of the device, the ratio between the "Maximum number of resident threads per multiprocessor" (2048 since CC 3.0) and the number of CUDA cores per SM will give you a good hint to calculate that total number of threads (threads * blocks) you want to schedule to best hide latency. Note that the optimal actually depends on the code itself, number of registers your kernel needs, etc.

Using local/shared memory as a cache for global

4 Answers