CUDA Global Array declaration and initialization before kernel call example

Question

I need some help with Cuda GLOBAL memory. In my project I must declare Global array for avoid to send this array at every kernel call.

EDIT:

My application can call the kernel more than 1,000 times , and on every call I'm sending him an array with size more than [1000 X 1000], So I think it's taking more time , that's why my app works slowly. So I need declare Global array for GPU, So my questions are

1 How to declare Global array

2 How to initialize Global array from CPU before kernel call

Thanks in advance

Shared memory only has block scope. You cannot allocate and populate shared memory before a kernel runs. — talonmies
Maybe I need global memory ? Can I declare global array for every block, thread? — Hayk Nahapetyan
Yes you can but it doesn't solve any problem, try keeping your data in global memory and when you call the kernel, just copy it to shared memory in the kernel. but doing such doesn't help if you are using that data once or you only need a single data in your thread. then you are better off using a normal variable and assign that data to it. This way the compiler will automatically convert that variable to register. — Soroosh Bateni
Thank you very much for response, Can you please take a look for my edits? — Hayk Nahapetyan

Robert Crovella Robert Crovella · Accepted Answer · 2013-02-24T20:53:24

Your edited question is confusing because you say you are sending your kernel an array of size 1000 x 1000 but you want to know how to do this using a global array. The only way I know of to send this much data to a kernel is to use a global array, so you are probably already doing this with an array in global memory.

Nevertheless, there are 2 methods, at least, to create and initialize an array in global memory:

1.statically, using __device__ and cudaMemcpyToSymbol, for example:

 #define SIZE 100
 __device__ int A[SIZE];
 ...
 int main(){
   int myA[SIZE];
   for (int i=0; i< SIZE; i++) myA[i] = 5;
   cudaMemcpyToSymbol(A, myA, SIZE*sizeof(int));
   ...
   (kernel calls, etc.)
 }

(device variable reference, cudaMemcpyToSymbol reference)

2.dynamically, using cudaMalloc and cudaMemcpy:

 #define SIZE 100
 ...
 int main(){
   int myA[SIZE];
   int *A;
   for (int i=0; i< SIZE; i++) myA[i] = 5;
   cudaMalloc((void **)&A, SIZE*sizeof(int));
   cudaMemcpy(A, myA, SIZE*sizeof(int), cudaMemcpyHostToDevice);
   ...
   (kernel calls, etc.)
 }

(cudaMalloc reference, cudaMemcpy reference)

For clarity I'm omitting error checking which you should do on all cuda calls and kernel calls.

CUDA Global Array declaration and initialization before kernel call example

2 Answers