2
votes

I used mobilenet model to train my images. It worked fine. In order to increase the accuracy I tried to replicate the same steps using a faster_rcnn_resnet101_coco model instead. All the steps I used were the same. When I initiate the training session, it got started and ran about 800 steps. The training loss at this point was around 0.5 which seems too good to be true. It stopped at this step and threw the following error:

The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 332, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 763, in train sess, train_op, global_step, train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) UnavailableError: Endpoint read failed To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=341450659208&resource=ml_job%2Fjob_id%2Fobject_detection_188003&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22object_detection_188003%22

Any idea what the problem could be? Any help is much appreciated.

1

1 Answers

1
votes

Thanks for feedback. We are still investigating the issue, and please use 1.2 runtime version for now.