I am training a CNN. the following error appear 3 time in this week. they all appear after a long run ( eg, 419140 steps ).
here is the partial log:
2017-09-15 11:16:03.515396: step 419120, loss = 0.30 (4427.4 examples/sec; 0.029 sec/batch) 2017-09-15 11:16:03.766922: step 419130, loss = 0.38 (5089.0 examples/sec; 0.025 sec/batch) 2017-09-15 11:16:04.073978: step 419140, loss = 0.40 (4168.5 examples/sec; 0.031 sec/batch) 2017-09-15 20:48:03.734101: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2017-09-15 20:48:03.734133: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1
If I restart the training, tensorflow will not utilize the GPU, here is the relevant log:
2017-09-15 21:54:38.681074: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
To make GPU work again, I have to restart my computer.
It appears the error happened in a c++ file which I am not familiar. Can some one give me some advice about how to debug or workaround this error?