I'm trying to train a SSD mobilenet v2 using Tensorflow Object Detection API, with Tensorflow GPU. The training goes well and fast until the first checkpoint save (after some hundreds of steps), where it gets stuck after restoring the last checkpoint. The GPU usage goes down and never comes up. Sometimes Python itself crashes.
I'm running Tensorflow GPU on Windows 7, with an NVIDIA Quadro M4000, with CUDA 8.0 (the only version I managed to work with). The model is an SSD Mobilenet v2 pretrained with COCO, using a very low batch size of 4.
The config file is the same as it comes out from the Tensorflow Model ZOO, of course changing paths, batch size, number of classes and number of steps and adding shuffle: true on the training part.
I'm adding the terminal infos that come out. This is where it gets stuck.
Did someone experience the same kind of problem or has any idea why?
Thanks in advance