My question is: is the CUDA hardware faulty, or is there possibly another explanation? I have a kernel which has been in use for about a year without modification. Recently, I started getting segmentation faults at irregular intervals, i.e. it could be reproduced, sometimes after a few minutes, sometimes after hours of execution. This led to a bare-minimum version of the program which still reproduced the segfault. As well as much learning from stackoverflow posts.
cuda-memcheck, when run in a repeat bash loop, will eventually report:
========= Invalid __global__ read of size 4
========= at 0x000048f0 in SegFault.cu:157:SegFault( float* )
========= by thread (128,0,0) in block (3706,0,0)
========= Address 0x003400e8 is out of bounds
The usual culprit of bad pointer operations was ruled out. Another clue was the illegal addressing was not consistent in where it occurred in the code; it was occurring irregularly for any index to a global array, throughout the kernel.
At this point in my question, the most likely explanation is buggy code. What leads me to believe the hardware is faulty comes from cuda-gdb:
cuda-gdb ./SegFaultTest
(cuda-gdb) set cuda memcheck on
(cuda-gdb) run
Illegal access to address (@global)0x245684 detected.
Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (5537,0,0), thread (0,0,0), device 0, sm 22, warp 28, lane 0]
0x00000000004f1ff8 in kernel( float * @global )<<<(33480,1,1),(512,1,1)>>> ( c=0x250000 ) at SegFault.cu:37
37 c[ix] += share_c[0];
(cuda-gdb) print &c[ix]
$2 = (@global float *) 0x255684
The index "ix" is:
int ix = blockIdx.x + blockIdx.y*gridDim.x;
And is not modified after instantiation. Indeed, 0x245684
is below the starting address for c=0x250000
. Yet, when I query print &c[ix]
it returns 0x255684
, which is an acceptable address for this array. Reproducing takes 10-50 executions before it pops up again, but the illegal address is always one bit 0x010000
different than what print &c[ix]
returns. I can't explain the difference in address between the error message and print. Combined with the one bit difference, I suspect faulty hardware. FWIW, 0x010000
is equal to the maximum grid size for this Tesla C1060.
And finally, I replaced the CUDA card today with a new model. I have not been able to reproduce after 100 executions.