I’m getting confused about how to use shared and global memory in CUDA, especially with respect to the following:
- When we use
cudaMalloc()
, do we get a pointer to shared or global memory? - Does global memory reside on the host or device?
- Is there a size limit to either one?
- Which is faster to access?
Is storing a variable in shared memory the same as passing its address via the kernel? I.e. instead of having
__global__ void kernel() { __shared__ int i; foo(i); }
why not equivalently do
__global__ void kernel(int *i_ptr) { foo(*i_ptr); } int main() { int *i_ptr; cudaMalloc(&i_ptr, sizeof(int)); kernel<<<blocks,threads>>>(i_ptr); }
There've been many questions about specific speed issues in global vs shared memory, but none encompassing an overview of when to use either one in practice.
Many thanks