Is there any performance related reason behind this?
Yes, a doubly-subscripted array normally requires an extra pointer lookup, i.e. an extra memory read, before the data referenced can be accessed. By using "simulated" 2D access:
int val = d[i*columns+j];
instead of:
int val = d[i][j];
then only a single memory read access is required. The proper indexing is computed directly, rather than requiring the read of a row-pointer. GPUs generally have lots of compute capability compared to memory bandwidth.
I don't understand what the author really means; actually, i find the above code correct
The code is in fact incorrect.
This operation:
cudaMalloc( (void**)&d_array , 5 * sizeof(int*) );
creates a single contiguous allocation on the device, of length equal to 5 pointers storage, and takes the starting address of that allocation, and stores it in the host memory location associated with d_array
. That is what cudaMalloc
does: it creates a device allocation of the requested length, and stores the starting device address of that allocation in the provided host memory variable.
So let's deconstruct what is being asked for here:
cudaMalloc((void **)&d_array[i],10 * sizeof(int));
This says, create a device allocation of length 10*sizeof(int)
and store the starting address of it in the location d_array[i]
. But the location associated with d_array[i]
is on the device, not the host, and requires dereferencing of the d_array
pointer to actually access it, to store something there.
cudaMalloc
does not do this. You cannot ask for the starting address of the device allocation to be stored in device memory. You can only ask for the starting address of the device allocation to be stored in host memory.
&d_array
is a pointer to host memory.
&d_array[i]
is a pointer to device memory.
The canonical 2D array worked example is now referenced in the cuda tag info link.