0
votes

In the CUDA examples i read, I don't find any direct use of 2D array notation [][] in the kernel code when the array is in the global memory unlike when it is in the shared memory, e.g. matrix multiplication. Is there any performance related reason behind this?

Also, i read in a old thread that the following code is incorrect

int **d_array;
cudaMalloc(  (void**)&d_array , 5 * sizeof(int*) );  

for(int i = 0 ; i < 5 ; i++) 
{ 
cudaMalloc((void **)&d_array[i],10 * sizeof(int));
}

According to the author, "once the main thread assigns memory on the device the main thread loses access to it, that is, it can only be accessed within kernels. So, When you try call cudaMalloc on the 2nd dimension of the array it throws an "Access violation writing location" exception."

I don't understand what the author really means; actually, i find the above code correct

Thank your for your help

SS

1

1 Answers

2
votes

Is there any performance related reason behind this?

Yes, a doubly-subscripted array normally requires an extra pointer lookup, i.e. an extra memory read, before the data referenced can be accessed. By using "simulated" 2D access:

int val = d[i*columns+j];

instead of:

int val = d[i][j];

then only a single memory read access is required. The proper indexing is computed directly, rather than requiring the read of a row-pointer. GPUs generally have lots of compute capability compared to memory bandwidth.

I don't understand what the author really means; actually, i find the above code correct

The code is in fact incorrect.

This operation:

cudaMalloc(  (void**)&d_array , 5 * sizeof(int*) );

creates a single contiguous allocation on the device, of length equal to 5 pointers storage, and takes the starting address of that allocation, and stores it in the host memory location associated with d_array. That is what cudaMalloc does: it creates a device allocation of the requested length, and stores the starting device address of that allocation in the provided host memory variable.

So let's deconstruct what is being asked for here:

cudaMalloc((void **)&d_array[i],10 * sizeof(int));

This says, create a device allocation of length 10*sizeof(int) and store the starting address of it in the location d_array[i]. But the location associated with d_array[i] is on the device, not the host, and requires dereferencing of the d_array pointer to actually access it, to store something there.

cudaMalloc does not do this. You cannot ask for the starting address of the device allocation to be stored in device memory. You can only ask for the starting address of the device allocation to be stored in host memory.

&d_array

is a pointer to host memory.

&d_array[i]

is a pointer to device memory.

The canonical 2D array worked example is now referenced in the cuda tag info link.