Faster RCNN Model training stops running on GCP, runs locally without issue

Question

Attempting to run program based on the Tensorflow Object Detection API. Faster RCNN Model stops training on GCP but runs locally without issue. Any feedback would be appreciated. Have tried Logs Writer role permission for Service Agent as suggested in different posts. Have not been able to find any more feedback.

Full Error Message:

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 194, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 296, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 763, in train sess, train_op, global_step, train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) UnavailableError: Endpoint read failed To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=1086278442266&resource=ml_job%2Fjob_id%2Fuav_object_detection_1543356760&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22uav_object_detection_1543356760%22

This is what I am running in Terminal to start training:

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
   --job-dir=gs://my_gcs_bucket/train \
   --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
   --module-name object_detection.train \
   --region us-central1 \
   --config object_detection/samples/cloud/cloud.yml \
   --runtime-version=1.4 \
   -- \
   --train_dir=gs://my_gcs_bucket/train \
   --pipeline_config_path=gs://my_gcs_bucket/data/faster_rcnn_resnet101.config

This is my file structure in the GCP Bucket

+ data/
  - faster_rcnn_resnet101.config
  - model.ckpt.index
  - model.ckpt.meta
  - model.ckpt.data-00000-of-00001
  - pet_label_map.pbtxt
  - train.record
  - val.record
+ train/

This is my file structure in the folder I am running from

+dist/
  -object_detection-0.1.tar.gz
+object_detection/
+object_detection.egg-info/
+slim/
setup.py

Config File:

# Faster R-CNN with Resnet-101 (v1) configured for the Oxford-IIIT Pet Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet101'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  batch_queue_capacity: 1
  num_batch_queue_threads: 1
  prefetch_queue_capacity: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 0
            learning_rate: .0003
          }
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "gs://my_gcs_bucket/data/model.ckpt"
  from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps:2000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://my_gcs_bucket/data/data/train.record"
  }
  label_map_path: "gs://my_gcs_bucket/data/data/label_map.pbtxt"
  queue_capacity: 10
  min_after_dequeue: 5
}

eval_config: {
  num_examples: 4
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "gs://my_gcs_bucket/data/data/val.record"
  }
  label_map_path: "gs://my_gcs_bucket/data/data/label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

Are you able to run the job successfully using runtime version 1.2 or other versions? From similar failures, the issue seems to be related to gRPC communication between the various nodes. — rpasricha
Yep, using version 1.2 was the solution. Interestingly I can run 1.10 on my local machine just fine. — hoss24

hoss24 hoss24 · Accepted Answer · 2018-11-29T18:37:01

0

votes

Changed runtime version to 1.2 in cloud.yml and initial request.

Faster RCNN Model training stops running on GCP, runs locally without issue

1 Answers