0
votes

I am trying to use ml-engine to tune some hyperparameters of a custom model. The model runs fine when I run on a single instance (eg, standard_gpu or complex_model_m_gpu), but fails when I try to run the same job on a cluster of gpu-enabled machines. I am following the instructions for CUSTOM tier using a config.yaml file, as described here. Adding this config file to the submission is the only change. Is there something else I need to do to run a distributed job?

I am submitting the job like this: g

cloud ml-engine jobs submit training $JOB_NAME \
    --job-dir $OUTPUT_PATH \
    --runtime-version 1.10 \
        --python-version 3.5 \
    --module-name module.run_task \
    --package-path module/ \
    --region $REGION \
        --config hptuning_config.yaml \
    -- \
    --train-files $TRAIN_DATA \
    --eval-files $EVAL_DATA \

My setup.py file requires tensorflow-probability 0.3.0 (the model breaks if I use 0.4.0).

The error (seen on all workers) is pasted below. Any help appreciated!

worker-replica-0 Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/module/run_task.py", line 74, in train_and_evaluate(hparams) File "/root/.local/lib/python3.5/site-packages/module/run_task.py", line 42, in train_and_evaluate tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 451, in train_and_evaluate return executor.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 617, in run getattr(self, task_to_run)() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 627, in run_worker return self._start_distributed_training() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 747, in _start_distributed_training self._start_std_server(config) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 735, in _start_std_server start=False) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/server_lib.py", line 147, in init self._server_def.SerializeToString(), status) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server

1
Are there any other errors for example on the parameter server or master?rhaertel80
Yes, the same error occurs on all workers and the master (but not ps). Also, (I just noticed) the errors all occur on their respective machines after one round of training completes (max train steps reached), and "Start Tensorflow server" appears in the logs. It looks like the workers die first, when they try to run tf.estimator.train_and_evaluate for the second time (my workflow involves 2 sequential training steps) while the master is still finishing the first round of training.bcleary

1 Answers

0
votes

This error occurs because you are calling tf.estimator.train_and_evaluate twice in succession. When the second train_and_evaluate call is made, not all gRPC servers from the first call have been closed and new servers are attempted to be opened on ports that are still in use.

Running multiple distributed jobs in succession is not supported in TensorFlow, as the parameter servers in particular will block until the process is killed. You'll need to refactor your code to only include a single call to tf.estimator.train_and_evaluate per job.