Suppose that we have an array int * data
, each thread will access one element of this array. Since this array will be shared among all threads it will be saved inside the global memory.
Let's create a test kernel:
__global__ void test(int *data, int a, int b, int c){ ... }
I know for sure that the data
array will be in global memory because I allocated memory for this array using cudaMalloc
. Now as for the other variables, I've seen some examples that pass an integer without allocating memory, immediately to the kernel function. In my case such variables are a
b
and c
.
If I'm not mistaken, even though we do not call directly cudaMalloc
to allocate 4 bytes for each three integers, CUDA will automatically do it for us, so in the end the variables a
b
and c
will be allocated in the global memory.
Now these variables, are only auxiliary, the threads only read them and nothing else.
My question is, wouldn't it be better to transfer these variables to the shared memory?
I imagine that if we had for example 10
blocks with 1024
threads, we would need 10*3 = 30
reads of 4
bytes in order to store the numbers in the shared memory of each block.
Without shared memory and if each thread has to read all these three variables once, the total amount of global memory reads will be 1024*10*3 = 30720
which is very inefficient.
Now here is the problem, I'm somewhat new to CUDA and I'm not sure if it's possible to transfer the memory for variables a
b
and c
to the shared memory of each block without having each thread reading these variables from the global memory and loading them to the shared memory, so in the end the total amount of global memory reads would be 1024*10*3 = 30720
and not 10*3 = 30
.
On the following website there is this example:
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
Here each thread loads different data inside the shared variable s
. So each thread, according to its index, loads the specified data inside the shared memory.
In my case, I want to load only variables a
b
and c
to the shared memory. These variables are always the same, they don't change, so they don't have anything to do with the threads themselves, they are auxiliary and are being used by each thread to run some algorithm.
How should I approach this problem? Is it possible to achieve this by only doing total_amount_of_blocks*3
global memory reads?