3
votes

CUDA programming guide states that "Memory allocated via malloc() can be copied using the runtime (i.e., by calling any of the copy memory functions from Device Memory)", but somehow I'm having trouble to reproduce this functionality. Code:

#include <cstdio>
__device__ int* p;

__global__ void allocate_p() {
  p = (int*) malloc(10);
  printf("p = %p  (seen by GPU)\n", p);
}

int main() {
  cudaError_t err;
  int* localp = (int*) malloc(10);

  allocate_p<<<1,1>>>();
  cudaDeviceSynchronize();

  //Getting pointer to device-allocated memory
  int* tmpp = NULL;
  cudaMemcpyFromSymbol(&tmpp, p, 4);
  printf("p = %p  (seen by CPU)\n", tmpp);

  //cudaMalloc((void**)&tmpp, 40);
  err = cudaMemcpy(tmpp, localp, 40, cudaMemcpyHostToDevice);
  cudaDeviceSynchronize();
  printf(" err:%i %s", (int)err, cudaGetErrorString(err));

  delete localp;
  return 0;
}

crashes with output:

p = 0x601f920  (seen by GPU)
p = 0x601f920  (seen by CPU)
 err:11 invalid argument

I gather, that the host sees the appropriate address on device, but somehow does not like it coming from malloc().

If I allocate earlier by cudaMalloc((void**)&np, 40); and then pass the pointer np as argument to kernel allocate_p, where it will be assigned to p (instead of malloc()), then the code runs fine.

What am I doing wrong / how do we use malloc() allocated device-memory in host-side functions?

1
The section you are quoting is from section "B.17.2 Interoperability with Host Memory API". malloc() is referring to host malloc() not device malloc(). This is definitely confusing given it follows the section "B.17.1 Heap Memory Allocation" that introduces device malloc(). – Greg Smith
@GregSmith: I can't understand how you reached that interpretation. The section is about device dynamic memory allocation (there are even code samples showing kernel malloc calls), and the reference is to host memory API functions described in Section 3.2.2, ie. cudaMemcpy. Where does host malloc come into the picture? I wouldn't say it is confusingly written, I would say it is just plain wrong (or there is a massive bug in the runtime as the alternative). – talonmies
@talonmies Yes, the section is wrong in older manuals and was copied for a different location. The statement should be removed from the next version of the programming guide. – Greg Smith

1 Answers

3
votes

As far as I am aware, it isn't possible to copy runtime heap memory using the host API functions. It certainly was not possible in CUDA 4.x and the CUDA 5.0 release candidate has not changed this. The only workaround I can offer is to use a kernel to "gather" final results and stuff them into a device transfer buffer or zero copy memory which can be accessed via the API or directly from the host. You can see an example of this approach in this answer and another question where Mark Harris from NVIDIA confirmed that this is a limitation of the (then) current implementation in the CUDA runtime.