Flattened vs multi-dimensional arrays in shared memory for CUDA

Question

Is there any performance advantage gained from using flattened arrays over multi-dimensional arrays in cuda shared memory?

I know that flattened arrays provide certain advantages over multi-dimensional arrays when it comes to the host memory space, but I wasn't sure if this applied to arrays in gpu shared memory. This is mainly because I've seen many code samples that do use multi-dimensional in shared memory, such as the matrix multiplication example in the CUDA Best Practices Guide.

Farzad Farzad · Accepted Answer · 2015-07-20T06:40:02

Multi-dimensional arrays are eventually flattened to map to the flat/linear shared memory address space. Therefore, there's no performance-wise advantage for using one over the other.

By using multi-dimensional shared arrays over the flattened you can avoid the burden of manual index calculation. For multi-dimensional arrays, index calculation will automatically be added to the final compiled code behind the curtain which makes reading the source code easier.

Maybe one advantage of using flat representation form over the multi-dimensional one is that you can easier reason about and spot bank conflicts in the shared memory accesses. Consider this shared memory buffers:

__shared__ int A[ 64 ];

If you access the buffer like this:

int laneID = threadIdx.x & 31;
int ret    = A[ laneID * 2 ];

It might be easier to distinguish the bank conflict between the threads compared to the below example:

__shared__ int B[ 32 ][ 2 ];

where you access it like this:

int laneID = threadIdx.x & 31;
int ret    = B[ laneID ][ 0 ];

Note that two above cases are functionally equivalent.

Flattened vs multi-dimensional arrays in shared memory for CUDA

1 Answers