40
votes

I’m getting confused about how to use shared and global memory in CUDA, especially with respect to the following:

  • When we use cudaMalloc(), do we get a pointer to shared or global memory?
  • Does global memory reside on the host or device?
  • Is there a size limit to either one?
  • Which is faster to access?
  • Is storing a variable in shared memory the same as passing its address via the kernel? I.e. instead of having

    __global__ void kernel() {
       __shared__ int i;
       foo(i);
    }
    

    why not equivalently do

    __global__ void kernel(int *i_ptr) {
       foo(*i_ptr);
    }
    
    int main() {
       int *i_ptr;
       cudaMalloc(&i_ptr, sizeof(int));
       kernel<<<blocks,threads>>>(i_ptr);
    }
    

There've been many questions about specific speed issues in global vs shared memory, but none encompassing an overview of when to use either one in practice.

Many thanks

3
@NolwennLeGuen The thing is you have no control of what data resides in the L1-L2 cache. While in shared memory you know exactly what is there.1-----1

3 Answers

53
votes
  • When we use cudaMalloc()

    In order to store data on the gpu that can be communicated back to the host, we need to have alocated memory that lives until it is freed, see global memory as the heap space with life until the application closes or is freed, it is visible to any thread and block that have a pointer to that memory region. Shared memory can be considered as stack space with life until a block of a kernel finishes, the visibility is limited to only threads within the same block. So cudaMalloc is used to allocate space in global memory.

  • Do we get a pointer to shared or global memory?

    You will get a pointer to a memory address residing in the global memory.

  • Does global memory reside on the host or device?

    Global memory resides on the device. However, there is ways to use the host memory as "global" memory using mapped memory, see: CUDA Zero Copy memory considerations however, it may be slow speeds due to bus transfer speed limitations.

  • Is there a size limit to either one?

    The size of the Global memory depends from card to card, anything from none to 32GB (V100). While the shared memory depend on the compute capability. Anything below compute capability 2.x have a maximum 16KB of shared memory per multiprocessor(where the amount of multiprocessors varies from card to card). And cards with compute capability of 2.x and greater have an minimum of 48KB of shared memory per multiprocessor.

    See https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

    If you are using mapped memory, the only limitation is how much the host machine have in memory.

  • Which is faster to access?

    In terms of raw numbers, shared memory is much faster (shared memory ~1.7TB/s, while global memory ~ XXXGB/s). However, in order to do anything you need to fill the shared memory with something, you usually pull from the global memory. If the memory access to global memory is coalesced(non-random) and big word size, you can achieve speeds close to the theoretical limit of hundreds of GB/s depending on the card and its memory interface.

    The use of shared memory is when you need to within a block of threads, reuse data already pulled or evaluated from global memory. So instead of pulling from global memory again, you put it in the shared memory for other threads within the same block to see and reuse.

    It is also common to be used as a scratch pad in order to reduce register pressure affecting how many work groups can be run at the same time.

  • Is storing a variable in shared memory the same as passing its address via the kernel?

    No, if you pass an address of anything, it always is an address to global memory. From the host you can't set the shared memory, unless you pass it either as an constant where the kernel sets the shared memory to that constant, or you pass it an address to global memory where it is pulled by the kernel when needed.

10
votes

The contents of global memory are visible to all the threads of grid. Any thread can read and write to any location of the global memory.

Shared memory is separate for each block of the grid. Any thread of a block can read and write to the shared memory of that block. A thread in one block cannot access shared memory of another block.

  1. cudaMalloc always allocates global memory.
  2. Global memory resides on the device.
  3. Obviously, every memory has a size limit. The global memory is the total amount of DRAM of the GPU you are using. e.g I use GTX460M which has 1536 MB DRAM, therefore 1536 MB global memory. Shared memory is specified by the device architecture and is measured on per-block basis. Devices of compute capability 1.0 to 1.3 have 16 KB/Block, compute 2.0 onwards have 48 KB/Block shared memory by default.
  4. Shared memory is magnitudes faster to access than global memory. Its like a local cache shared among the threads of a block.
  5. No. Only global memory addresses can be passed to a kernel launched from host. In your first example, the variable is read from the shared memory, while in the second one, it is read from the global memory.

Update:

Devices of Compute Capability 7.0 (Volta Architecture) allow allocating shared memory of up-to 96 KB per block, provided the following conditions are satisfied.

  • Shared memory is allocated dynamically
  • Before launching the kernel, the maximum size of dynamic shared memory is specified using the function cudaFuncSetAttribute as follows.

__global__ void MyKernel(...)
{
    extern __shared__ float shMem[];
}

int bytes = 98304; //96 KB
cudaFuncSetAttribute(MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, bytes);

MyKernel<<<gridSize, blockSize, bytes>>>(...);
3
votes

CUDA shared memory is memory shared between the threads within a block, i.e. between blocks in a grid the contents of shared memory are undefined. It can be thought as a manually managed L2 cache.

Usually global memory resides on the device, but recent versions of CUDA (if the device supports it) can map host memory into device address space, triggering an in-situ DMA transfer from host to device memory in such occasions.

There's a size limit on shared memory, depending on the device. Its reported in the device capabilities, retrieved when enumerating CUDA devices. Global memory is limited by the total memory available to the GPU. For example a GTX680 offers 48kiB of shared memory and 2GiB device memory.

Shared memory is faster to access than global memory, but access patterns must be aligned carefully (for both shared and global memory) to be efficient. If you can't make your access patterns properly aligned, use textures (also global memory, but accessed through a different circurity and cache, that can deal better with unaligned access).

Is storing a variable in shared memory the same as passing its address via the kernel?

No, definitely not. The code you proposed would be a case where you'd use in-situ transferred global memory. Shared memory can not be passed between kernels, as the contents of a shared block are defined within a execution block of threads only.