Cuda Memcpy Device to Host : unspecified error launch failure

Question

This is a simple test program that I have been working on (to help aid with debugging my work on a running sum function) and I just cannot seem to find whats wrong. The program simply calls my running sum function on a small list and attempts to print out the data. The line thats creating all the trouble is the one thats commented out. Its the cudaMemcpy(DeviceToHost). When that line is part of the code, the error I get is :

CUDA error at: student_func.cu:136 unspecified launch failure
cudaGetLastError() terminate called after throwing an instance of
'thrust::system::system_error' what(): unload of CUDA runtime failed

I simply do not know whats wrong with this and its driving me insane. I tried using regular old malloc with the same result. I have confirmed that the input data gets copied over to the device array fine (by printing in the kernel) but simply am not able to copy back the results from Device to Host. I would really appreciate any help whatsoever! Thanks in advance :)

unsigned int numElems = 100;
unsigned int blockLength = min( (unsigned int) 1024, (unsigned int) numElems);
unsigned int gridLength = ceil ( (float) numElems / (float) blockLength );

unsigned int* d_in;

unsigned int* h_in;
checkCudaErrors(cudaMallocHost(&h_in, sizeof(unsigned int) * numElems));

for (int i = 0; i < numElems; i++)
{
   h_in[i] = i;
}

checkCudaErrors(cudaMalloc(&d_in, sizeof(unsigned int) * numElems));
checkCudaErrors(cudaMemcpy(d_in, h_in, sizeof(unsigned int) * numElems, cudaMemcpyHostToDevice));

exclusive_running_sum<<< gridLength, blockLength >>>(d_in, d_in, numElems);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

//this line is a problem!!
//checkCudaErrors(cudaMemcpy(h_in, d_in, sizeof(unsigned int) * numElems, cudaMemcpyDeviceToHost));

for (int i = 0; i < numElems; i++)
{
    printf("%i %i\n", i, h_in[i]);
}

The cudaMemcpy call isn't the problem. Your kernel is failing somehow. — talonmies
I've found an unspecified launch failure is sometimes bad block or grid sizes. Try changing them to, say, 1 and seeing if it runs. Also, are they ok as ints? It's been a long time since I've CUDAd. — P O'Conbhui
@PO'Conbhui: No. illegal block or grid dimensions will produce a cudaErrorInvalidConfiguration error in the runtime API. — talonmies
Thanks for the replies. I have confirmed that the kernel works fine when that line is commented out. I even tried printing out the results (from the kernel itself) and they are fine. Its definitely the copy thats causing this. I will try playing with the array size and get back. — Rajiv Nair
@RajivNair: Could you edit your question to include the shortest, complete version of the code which someone else could compile and run themselves. The error you are seeing is really being generated by the kernel, it will be a combination of imperfect error checking and inexperience which is leading you to an incorrect diagnosis of the problem. But we can't help you without seeing a complete example which replicates the problem. — talonmies

Rajiv Nair Rajiv Nair · Accepted Answer · 2013-07-07T15:59:54

Thanks to everyone for the help. I have found the bug. After much debugging, I have realized that I (very very foolishly) forgot about the fact that I had used an externally allocated shared data within the kernel.

Cuda Memcpy Device to Host : unspecified error launch failure

1 Answers