CUDA kernels consistently returning bad results

Question

I am a CUDA beginner who has successfully compiled and run several code samples using CUDA libraries such as CUFFT and CUBLAS. Lately, however, I have been trying to generate my own simple kernels and am repeatedly receiving nonsense values back after calling my kernels. That is--when I pass a parameter into a kernel, set its value in the kernel, then try to copy the results back to the host and read the values later, they are bogus. I have tried many different simple tutorial kernels that seem to work for most people online, but I always get nonsensical values. For example...

#define SIZE 10

    //  Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide                                      
    __global__  void vecAdd(float* A, float* B, float* C) {

      // threadIdx.x is a built-in variable  provided by CUDA at runtime                                                
      int i = threadIdx.x;
      A[i]=0;
      B[i]=i;
      C[i] = A[i] + B[i];

    }

    int main {

      int N=SIZE;
      float A[SIZE], B[SIZE], C[SIZE];
      float *devPtrA;
      float *devPtrB;
      float *devPtrC;
      int memsize= SIZE * sizeof(float);

      cudaMalloc((void**)&devPtrA, memsize);
      cudaMalloc((void**)&devPtrB, memsize);
      cudaMalloc((void**)&devPtrC, memsize);
      cudaMemcpy(devPtrA, A, memsize,  cudaMemcpyHostToDevice);
      cudaMemcpy(devPtrB, B, memsize,  cudaMemcpyHostToDevice);
      // __global__ functions are called:  Func<<< Dg, Db, Ns >>>(parameter);                                          
      vecAdd<<<1, N>>>(devPtrA,  devPtrB, devPtrC);
      cudaMemcpy(C, devPtrC, memsize,  cudaMemcpyDeviceToHost);

      for (int i=0; i<SIZE; i++)
        printf("C[%d]=%f\n",i,C[i]);

      cudaFree(devPtrA);
      cudaFree(devPtrA);
      cudaFree(devPtrA);

}

This is a fairly straightforward problem; the results should be:

C[0]=0.000000 
C[1]=1.000000 
C[2]=2.000000 
C[3]=3.000000 
C[4]=4.000000 
C[5]=5.000000 
C[6]=6.000000 
C[7]=7.000000 
C[8]=8.000000 
C[9]=9.000000

However, my awesome results are always random and generally look more like:

C[0]=nan
C[1]=-32813464158208.000000
C[2]=nan
C[3]=-27667211200843743232.000000
C[4]=34559834084263395806523272811251761152.000000
C[5]=9214363188332593152.000000
C[6]=nan
C[7]=-10371202300694685655937271382147072.000000
C[8]=121653576586393934243511643668480.000000
C[9]=-30648783863808.000000

So basically, when I pass parameters into a CUDA kernel with the intention of storing results within them to be copied back to the host, I tend to get out junk.

This one really has me stumped. Any help would be greatly appreciated.

Thanks.

You should probably be bracketing your calls to cudaMalloc and cudaMemcpy with CUDA_SAFE_CALL as currently you're not doing any error checking. — Paul R
I just added -deviceemu parameter to NVCC and received the warning, "option 'device-emulation' has been deprecated and is ignored". Googling suggests emulation is no longer supported, or am I going about it incorrectly? — nedblorf
@nedblorf: that's a shame - it's been a while since I did any CUDA work but emulation mode always used to be a useful debugging technique. I think I heard that the latest CUDA tools now include a debugger ? You may have to try that instead. — Paul R
BTW, just out of curiosity, do you get garbage in A and B as well ? — Paul R

Tom Tom · Accepted Answer · 2011-02-15T12:07:30

You should always check for errors returned by API calls. C developers are completely accustomed to checking for NULL from malloc() for example, since not checking for NULL frequently results in a null pointer dereference later on (bad things ensue). C++ developers often rely on exceptions, but many APIs are C-style (including the CUDA calls you are using and many other libraries) so you should be aware of when to check for errors.

Ideally you would check for errors on every API call. Personally I wouldn't use the SAFE_CALL macros from CUTIL, instead I would check the error, handle it properly and throw an exception (C++) or at least clean up properly. That way, when you grow the experiment into a bigger app you have already thought about error handling.

At the very least, you should check for an error at the end:

cudaError_t cudaResult;
cudaResult = cudaGetLastError();
if (cudaResult != cudaSuccess)
{
    // Do whatever you want here
    // I normally create a std::string msg with a description of where I am
    // and append cudaGetErrorString(cudaResult)
}

CUDA kernels consistently returning bad results

4 Answers