8
votes

I'm not finding an improvement in speed with shared memory on an NVIDIA Tesla M2050 with about 49K shared memory per block. Actually if I allocate a large char array in shared memory it slows down my program. For example

__shared__ char database[49000];

gives me slower running times than

__shared__ char database[4900];

The program accesses only the first 100 chars of database so the extra space is unnecessary. I can't figure out why this is happening. Any help would be appreciated. Thanks.

3

3 Answers

33
votes

The reason for the relatively poor performance of CUDA shared memory when using larger arrays may have to do with the fact that each multiprocessor has a limited amount of available shared memory.

Each multiprocessor hosts several processors; for modern devices, typically 32, the number of threads in a warp. This means that, in the absence of divergence or memory stalls, the average processing rate is 32 instructions per cycle (latency is high due to pipelining).

CUDA schedules several blocks to a multiprocessor. Each block consists of several warps. When a warp stalls on a global memory access (even coalesced accesses have high latency), other warps are processed. This effectively hides the latency, which is why high-latency global memory is acceptable in GPUs. To effectively hide latency, you need enough extra warps to execute until the stalled warp can continue. If all warps stall on memory accesses, you can no longer hide the latency.

Shared memory is allocated to blocks in CUDA, and stored on a singly multiprocessor on the GPU device. Each multiprocessor has a relatively small, fixed amount of shared memory space. CUDA cannot schedule more blocks to multiprocessors than the multiprocessors can support in terms of shared memory and register usage. In other words, if the amount of shared memory on a multiprocessor is X and each block requires Y shared memory, CUDA will schedule no more than floor(X/Y) blocks at a time to each multiprocessor (it might be less since there are other constraints, such as register usage).

Ergo, by increasing shared memory usage of a block, you might be reducing the number of active warps - the occupancy - of your kernel, thereby hurting performance. You should look into your kernel code by compiling with the -Xptxas="-v" flag; this should give you register and shared & constant memory usage for each kernel. Use this data and your kernel launch parameters, as well as other required information, in the most recent version of the CUDA Occupancy Calculator to determine whether you might be affected by occupancy.

EDIT:

To address the other part of your question, assuming no shared memory bank conflicts and perfect coalescing of global memory accesses... there are two dimensions to this answer: latency and bandwidth. The latency of shared memory will be lower than that of global memory, since shared memory is on-chip. The bandwidth will be much the same. Ergo, if you are able to hide global memory access latency through coalescing, there is no penalty (note: the access pattern is important here, in that shared memory allows for potentially more diverse access patterns with little to no performance loss, so there can be benefits to using shared memory even if you can hide all the global memory latency).

2
votes

Also, if you increase the shared memory per-block, CUDA will schedule grids with less concurrent blocks, so they all have enough shared memory, so it reduces parallelism and increases execution time.

1
votes

The amount of resources available on the gpu are limited. The number of blocks running concurrently is roughly inversly proportional to the size of shared memory per block.

This explains why the runtime is slower when you launch the kernel which uses a really large amount of shared memory.