I am trying to use ml-engine to tune some hyperparameters of a custom model. The model runs fine when I run on a single instance (eg, standard_gpu or complex_model_m_gpu), but fails when I try to run the same job on a cluster of gpu-enabled machines. I am following the instructions for CUSTOM tier using a config.yaml file, as described here. Adding this config file to the submission is the only change. Is there something else I need to do to run a distributed job?
I am submitting the job like this: g
cloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.10 \
--python-version 3.5 \
--module-name module.run_task \
--package-path module/ \
--region $REGION \
--config hptuning_config.yaml \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
My setup.py file requires tensorflow-probability 0.3.0 (the model breaks if I use 0.4.0).
The error (seen on all workers) is pasted below. Any help appreciated!
worker-replica-0 Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/module/run_task.py", line 74, in train_and_evaluate(hparams) File "/root/.local/lib/python3.5/site-packages/module/run_task.py", line 42, in train_and_evaluate tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 451, in train_and_evaluate return executor.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 617, in run getattr(self, task_to_run)() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 627, in run_worker return self._start_distributed_training() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 747, in _start_distributed_training self._start_std_server(config) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 735, in _start_std_server start=False) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/server_lib.py", line 147, in init self._server_def.SerializeToString(), status) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server