4
votes

cuda-memcheck is reporting this information for a release mode CUDA kernel:

========= Error: process didn't terminate successfully
========= Invalid __global__ read of size 4
=========     at 0x000002c8 in xx_kernel
=========     by thread (0,0,0) in block (0,0)
=========     Address 0x10101600014 is out of bounds
=========
========= ERROR SUMMARY: 1 error

This fault only happens in release mode. It also doesn't happen when running under cuda-gdb.

How can I take the 0x000002c8 address and determine the code that is causing the fault? I've looked through the cached intermediate files (.ptx, .cubin, etc) and see no obvious way to determine the faulty source code.

This is on x86_64 Linux with CUDA 3.2.

UPDATE: Turns out it was a compiler bug in 3.2. Upgrading to 4.0 makes the memcheck error go away. Also, I was able to disassemble the CUBIN with the cuobjdump from 4.0, but since it was release mode and optimized, it was exceedingly difficult to match the disassembly to the source code.

2
Can you post your kernel code so that we can see why this thread accesses an out of bound area?jopasserat
Unfortunately its proprietary source code so I can't post the actual code. Thanks.dwelch91

2 Answers

7
votes

Download the CUDA Toolkit 4.0 from the NVIDIA Developer Zone. Use the new cuobjdump that supports 2.x cubins.

cuobjdump -sass /path/to/your/cubin > /path/to/dump.txt.

Example output (tested on a sm_20 cubin, code version 2.3)

    ...
/*6018*/     /*0xe00100075003ff9a*/     CAL 0x46d8;
/*6020*/     /*0x10001de428000000*/     MOV R0, R4;
/*6028*/     /*0x00001de428000000*/     MOV R0, R0;
/*6030*/     /*0x40011de428000000*/     MOV R4, R16;
    ...
4
votes

This kind of errors within a kernel is tied to a memory access which is not only based upon the thread identifier.

Considering that every memory area you use has been correctly allocated for the GPU, access based only upon something like threadIdx.x shouldn't cause any problem. Thus:

  • either you have a wrong index calculation (it is frequent with expressions like data[blockDim.y * blockDim.x * threadIdx.z + blockDim.x * threadIdx.y + threadIdx.x] for instance)
  • or you use another variable in your index calculation which make it exceed your array bounds (for example data[threadIdx.x + offset])

---- edit (following comments) ----
See @Cicada's answer for the complement on cuobjdump for device > 2.x