3
votes

I need some help with Cuda GLOBAL memory. In my project I must declare Global array for avoid to send this array at every kernel call.

EDIT:

My application can call the kernel more than 1,000 times , and on every call I'm sending him an array with size more than [1000 X 1000], So I think it's taking more time , that's why my app works slowly. So I need declare Global array for GPU, So my questions are

1 How to declare Global array

2 How to initialize Global array from CPU before kernel call

Thanks in advance

2
Shared memory only has block scope. You cannot allocate and populate shared memory before a kernel runs. - talonmies
Maybe I need global memory ? Can I declare global array for every block, thread? - Hayk Nahapetyan
Yes you can but it doesn't solve any problem, try keeping your data in global memory and when you call the kernel, just copy it to shared memory in the kernel. but doing such doesn't help if you are using that data once or you only need a single data in your thread. then you are better off using a normal variable and assign that data to it. This way the compiler will automatically convert that variable to register. - Soroosh Bateni
Thank you very much for response, Can you please take a look for my edits? - Hayk Nahapetyan

2 Answers

5
votes

Your edited question is confusing because you say you are sending your kernel an array of size 1000 x 1000 but you want to know how to do this using a global array. The only way I know of to send this much data to a kernel is to use a global array, so you are probably already doing this with an array in global memory.

Nevertheless, there are 2 methods, at least, to create and initialize an array in global memory:

1.statically, using __device__ and cudaMemcpyToSymbol, for example:

 #define SIZE 100
 __device__ int A[SIZE];
 ...
 int main(){
   int myA[SIZE];
   for (int i=0; i< SIZE; i++) myA[i] = 5;
   cudaMemcpyToSymbol(A, myA, SIZE*sizeof(int));
   ...
   (kernel calls, etc.)
 }

(device variable reference, cudaMemcpyToSymbol reference)

2.dynamically, using cudaMalloc and cudaMemcpy:

 #define SIZE 100
 ...
 int main(){
   int myA[SIZE];
   int *A;
   for (int i=0; i< SIZE; i++) myA[i] = 5;
   cudaMalloc((void **)&A, SIZE*sizeof(int));
   cudaMemcpy(A, myA, SIZE*sizeof(int), cudaMemcpyHostToDevice);
   ...
   (kernel calls, etc.)
 }

(cudaMalloc reference, cudaMemcpy reference)

For clarity I'm omitting error checking which you should do on all cuda calls and kernel calls.

0
votes

If I understand well this question, which is kind of unclear, you want to use global array and send it to the device in every kernel call. This bad practice leads to high latency because in every kernel call you need to transfer your data to the device. In my experience such practice led to negative speed-up.

An optimal way would be to use what I call flip-flop technique. The way you do it is:

  1. Declare two array in the device. d_arr1 and d_arr2
  2. Copy the data host -> device into one of the arrays.
  3. Pass as kernel's parameters pointers to d_arr1 and d_arr2
  4. Process the data into the kernel.
  5. In consequent kernel calls you exchange the pointers you are passing as parameters

This way you avoid to transfer the data every kernel call. You transfer only at the beginning and at the end of your host loop.

int a, even =0;
for(a=0;a<1000;a++)
{
  if (even % 2 ==0 )
   //call to the kernel(pointer_a, pointer_b)
  else
  //call to the kernel(pointer_b, pointer_a)
}