3
votes

Was looking to use system functions (such as rand() ) within the CUDA kernel. However, ideally this would just run on the CPU. Can I separate files (.cu and .c++), while still making use of gpu matrix addition? For example, something along these lines:

in main.cpp:

int main(){
std::vector<int> myVec;
srand(time(NULL));

for (int i = 0; i < 1024; i++){
    myvec.push_back( rand()%26);
}

selfSquare(myVec, 1024);

}

and in cudaFuncs.cu:

__global__ void selfSquare_cu(int *arr, n){
    int i = threadIdx.x;
    if (i < n){
        arr[i] = arr[i] * arr[i];
    }

}

void selfSquare(std::vector<int> arr, int n){
    int *cuArr;
    cudaMallocManaged(&cuArr, n * sizeof(int));
    for (int i = 0; i < n; i++){
        cuArr[i] = arr[i];
    }

    selfSquare_cu<<1, n>>(cuArr, n);
}

What are best practices surrounding situations like these? Would it be a better idea to use curand and write everything in the kernel? It looks to me like in the above example, there is an extra step in taking the vector and copying it to the shared cuda memory.

1
The less communication, usually the better. Lots of small things can be done on the GPU with a smaller cost than transferring data back and forth. Then you shouldn't use rand to get random numbers. - Matthieu Brucher

1 Answers

2
votes

In this case the only thing that you need is to have the array initialised with random values. Each value of the array can be initialised indipendently. The CPU is involved in your code during the initialization and trasferring of the data to the device and back to the host.

In your case, do you really need to have the CPU to initialize the data for then having all those values moved to the GPU?

The best approach is to allocate some device memory and then initialize the values using a kernel. This will save time because

  • The elements are initialized in parallel
  • There is not memory transfer required from the host to the device

As a rule of thumb, always avoid communication between host and device if possible.