2
votes

I am training a CNN. the following error appear 3 time in this week. they all appear after a long run ( eg, 419140 steps ).

here is the partial log:

2017-09-15 11:16:03.515396: step 419120, loss = 0.30 (4427.4 examples/sec; 0.029 sec/batch) 2017-09-15 11:16:03.766922: step 419130, loss = 0.38 (5089.0 examples/sec; 0.025 sec/batch) 2017-09-15 11:16:04.073978: step 419140, loss = 0.40 (4168.5 examples/sec; 0.031 sec/batch) 2017-09-15 20:48:03.734101: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2017-09-15 20:48:03.734133: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1

If I restart the training, tensorflow will not utilize the GPU, here is the relevant log:

2017-09-15 21:54:38.681074: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN

To make GPU work again, I have to restart my computer.

It appears the error happened in a c++ file which I am not familiar. Can some one give me some advice about how to debug or workaround this error?

2

2 Answers

1
votes

I faced the same problem and I found a suggestion on why it's happening here : https://devtalk.nvidia.com/default/topic/1046479/gpu-occasionally-gets-lost-when-running-tensorflow-/

Apparently, when Nvidia GPU overheats it throws this error!

0
votes

I hit the error again. this time I noticed that there is a message said: core dumped, I forgot to save the message. but from my experience, the program(or python or OS) should saved some dump/log file for analysis. any clue where I can find it?

I found the cause of this. This error occurs when I put my computer into suspend(S3), when my computer resume back from S3, this error occurs. Maybe the CUDA driver don't support S3 on linux yet. I will dig up deeper on nvidia official website when I have time.