tensorflow throw CUDA_ERROR_LAUNCH_FAILED after a long run

Question

I am training a CNN. the following error appear 3 time in this week. they all appear after a long run ( eg, 419140 steps ).

here is the partial log:

2017-09-15 11:16:03.515396: step 419120, loss = 0.30 (4427.4 examples/sec; 0.029 sec/batch) 2017-09-15 11:16:03.766922: step 419130, loss = 0.38 (5089.0 examples/sec; 0.025 sec/batch) 2017-09-15 11:16:04.073978: step 419140, loss = 0.40 (4168.5 examples/sec; 0.031 sec/batch) 2017-09-15 20:48:03.734101: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2017-09-15 20:48:03.734133: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1

If I restart the training, tensorflow will not utilize the GPU, here is the relevant log:

2017-09-15 21:54:38.681074: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN

To make GPU work again, I have to restart my computer.

It appears the error happened in a c++ file which I am not familiar. Can some one give me some advice about how to debug or workaround this error?

sniper71 sniper71 · Accepted Answer · 2019-09-26T08:27:24

I faced the same problem and I found a suggestion on why it's happening here : https://devtalk.nvidia.com/default/topic/1046479/gpu-occasionally-gets-lost-when-running-tensorflow-/

Apparently, when Nvidia GPU overheats it throws this error!

tensorflow throw CUDA_ERROR_LAUNCH_FAILED after a long run

2 Answers