1
votes

I'm trying to train a SSD mobilenet v2 using Tensorflow Object Detection API, with Tensorflow GPU. The training goes well and fast until the first checkpoint save (after some hundreds of steps), where it gets stuck after restoring the last checkpoint. The GPU usage goes down and never comes up. Sometimes Python itself crashes.

I'm running Tensorflow GPU on Windows 7, with an NVIDIA Quadro M4000, with CUDA 8.0 (the only version I managed to work with). The model is an SSD Mobilenet v2 pretrained with COCO, using a very low batch size of 4.

The config file is the same as it comes out from the Tensorflow Model ZOO, of course changing paths, batch size, number of classes and number of steps and adding shuffle: true on the training part.

I'm adding the terminal infos that come out. This is where it gets stuck.

Did someone experience the same kind of problem or has any idea why?

Thanks in advance

enter image description here

1
Did you ever figure this out? I have the same problem..Atle Kristiansen
@AtleKristiansen Actually I didn't. I really worked my mind to find a solution but I didn't. Basically the issue is that model_main.py restores the parameters every time a checkpoint is saved and here's when the training gets stuck. The only solution I found is to use the legacy scripts train.py and eval.py under the legacy folder of the Object Detection API: train.py doesn't restore from the saved checkpoint every time.Gian Mauro Musso
yea, that is what I figured as well. The model_main.py does not work, which I find quite odd (that not everybody is stating it). Maybe you should add your solution to this problem here, to make it easier for someone else to find a workaround for the problem.Atle Kristiansen
i just ran into similar problem and using train.py worked.thanks.ARK4579

1 Answers

1
votes

I faced the same problem as you stated. I waited a long time and found something interesting. I got some evaluation results. The training process continued after that. It seems that the evaluation process takes too much time. As it gives no output at the beginning, it just like get stuck. Maybe changing the parameter 'sample_1_of_n_eval_examples' will help. I'm trying...