What's the difference between CUDA shared and global memory?

Question

I’m getting confused about how to use shared and global memory in CUDA, especially with respect to the following:

When we use cudaMalloc(), do we get a pointer to shared or global memory?
Does global memory reside on the host or device?
Is there a size limit to either one?
Which is faster to access?

Is storing a variable in shared memory the same as passing its address via the kernel? I.e. instead of having

__global__ void kernel() {
   __shared__ int i;
   foo(i);
}

why not equivalently do

__global__ void kernel(int *i_ptr) {
   foo(*i_ptr);
}

int main() {
   int *i_ptr;
   cudaMalloc(&i_ptr, sizeof(int));
   kernel<<<blocks,threads>>>(i_ptr);
}

There've been many questions about specific speed issues in global vs shared memory, but none encompassing an overview of when to use either one in practice.

Many thanks

@NolwennLeGuen The thing is you have no control of what data resides in the L1-L2 cache. While in shared memory you know exactly what is there. — 1-----1

1-----1 1-----1 · Accepted Answer · 2012-12-30T19:28:38

When we use cudaMalloc()

In order to store data on the gpu that can be communicated back to the host, we need to have alocated memory that lives until it is freed, see global memory as the heap space with life until the application closes or is freed, it is visible to any thread and block that have a pointer to that memory region. Shared memory can be considered as stack space with life until a block of a kernel finishes, the visibility is limited to only threads within the same block. So cudaMalloc is used to allocate space in global memory.
Do we get a pointer to shared or global memory?

You will get a pointer to a memory address residing in the global memory.
Does global memory reside on the host or device?

Global memory resides on the device. However, there is ways to use the host memory as "global" memory using mapped memory, see: CUDA Zero Copy memory considerations however, it may be slow speeds due to bus transfer speed limitations.
Is there a size limit to either one?

The size of the Global memory depends from card to card, anything from none to 32GB (V100). While the shared memory depend on the compute capability. Anything below compute capability 2.x have a maximum 16KB of shared memory per multiprocessor(where the amount of multiprocessors varies from card to card). And cards with compute capability of 2.x and greater have an minimum of 48KB of shared memory per multiprocessor.

See https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

If you are using mapped memory, the only limitation is how much the host machine have in memory.
Which is faster to access?

In terms of raw numbers, shared memory is much faster (shared memory ~1.7TB/s, while global memory ~ XXXGB/s). However, in order to do anything you need to fill the shared memory with something, you usually pull from the global memory. If the memory access to global memory is coalesced(non-random) and big word size, you can achieve speeds close to the theoretical limit of hundreds of GB/s depending on the card and its memory interface.

The use of shared memory is when you need to within a block of threads, reuse data already pulled or evaluated from global memory. So instead of pulling from global memory again, you put it in the shared memory for other threads within the same block to see and reuse.

It is also common to be used as a scratch pad in order to reduce register pressure affecting how many work groups can be run at the same time.
Is storing a variable in shared memory the same as passing its address via the kernel?

No, if you pass an address of anything, it always is an address to global memory. From the host you can't set the shared memory, unless you pass it either as an constant where the kernel sets the shared memory to that constant, or you pass it an address to global memory where it is pulled by the kernel when needed.

What's the difference between CUDA shared and global memory?

3 Answers