An error occurs when running in Cloud Shell the sample code Google's @SlavenBilac posted to train and classify images using Google Cloud Machine Learning and Cloud Dataflow.
The code gets stuck stuck at global_step/sec: 0
INFO 2017-02-16 06:28:36 -0600 master-replica-0 Start master session 538be2b71d17c4dc with config:
ERROR 2017-02-16 06:28:36 -0600 master-replica-0 device_filters: "/job:ps"
ERROR 2017-02-16 06:28:36 -0600 master-replica-0 device_filters: "/job:master/task:0"
INFO 2017-02-16 06:28:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:30:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:32:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:34:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:36:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:38:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:40:39 -0600 master-replica-0 global_step/sec: 0
<keeps repeating until I kill the job>
Based on Google's @JoshGC answer to a similar question, I created an entirely new Google Cloud account (with new billing account, new project, etc) yesterday, then ran CloudShell setup script and the other steps to setup environment, then ran the sample code against the sample flower data. The error occurs (as shown below), so I don't think the cause is related to data nor my account configuration.
How can one modify the file(s) from GoogleCloudPlatform/cloudml-samples/flowers to avoid this error?
Excerpts:
Run sample code
cfinley3@wordthree-wordfour-7654321:~/google-cloud-ml/samples/flowers$ ./sample.sh
Your active configuration is: [cloudshell-18758]
Using job id: flowers_cfinley3_20170216_045347
Preprocess seems ok
python trainer/preprocess.py \
--input_dict "$DICT_FILE" \
--input_path "gs://cloud-ml-data/img/flower_photos/train_set.csv" \
--output_path "${GCS_PATH}/preprocess/train" \
--cloud
Training starts
gcloud beta ml jobs submit training "$JOB_ID" \
--module-name trainer.task \
--package-path trainer \
--staging-bucket "$BUCKET" \
--region us-central1 \
-- \
--output_path "${GCS_PATH}/training" \
--eval_data_paths "${GCS_PATH}/preproc/eval*" \
--train_data_paths "${GCS_PATH}/preproc/train*"
Job [flowers_cfinley3_20170216_045347] submitted successfully.
Training gets stuck at at global_step/sec: 0
INFO 2017-02-16 06:24:48 -0600 unknown_task Validating job requirements...
INFO 2017-02-16 06:24:48 -0600 unknown_task Job creation request has been successfully validated.
INFO 2017-02-16 06:24:48 -0600 unknown_task Job flowers_cfinley3_20170216_045347 is queued.
INFO 2017-02-16 06:24:55 -0600 unknown_task Waiting for job to be provisioned.
INFO 2017-02-16 06:24:55 -0600 unknown_task Waiting for TensorFlow to start.
INFO 2017-02-16 06:28:27 -0600 master-replica-0 Running task with arguments: --cluster={"master": ["master-9a431abe8e-0:2222"]} --task={"type": "master", "index": 0} --job={
INFO 2017-02-16 06:28:27 -0600 master-replica-0 "package_uris": ["gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz"],
INFO 2017-02-16 06:28:27 -0600 master-replica-0 "python_module": "trainer.task",
INFO 2017-02-16 06:28:27 -0600 master-replica-0 "args": ["--output_path", "gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/training", "--eval_data_paths", "gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/eval*", "--train_data_paths", "gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/train*"],
INFO 2017-02-16 06:28:27 -0600 master-replica-0 "region": "us-central1"
INFO 2017-02-16 06:28:27 -0600 master-replica-0 } --beta
INFO 2017-02-16 06:28:28 -0600 master-replica-0 Running module trainer.task.
INFO 2017-02-16 06:28:28 -0600 master-replica-0 Running command: gsutil -q cp gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz trainer-0.1.tar.gz
INFO 2017-02-16 06:28:29 -0600 master-replica-0 Installing the package: gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz
INFO 2017-02-16 06:28:29 -0600 master-replica-0 Running command: pip install --user --upgrade --force-reinstall trainer-0.1.tar.gz
INFO 2017-02-16 06:28:29 -0600 master-replica-0 Processing ./trainer-0.1.tar.gz
INFO 2017-02-16 06:28:30 -0600 master-replica-0 Building wheels for collected packages: trainer
INFO 2017-02-16 06:28:30 -0600 master-replica-0 Running setup.py bdist_wheel for trainer: started
INFO 2017-02-16 06:28:30 -0600 master-replica-0 creating '/tmp/tmpn9HeiIpip-wheel-/trainer-0.1-cp27-none-any.whl' and adding '.' to it
INFO 2017-02-16 06:28:30 -0600 master-replica-0 adding 'trainer/model.py'
INFO 2017-02-16 06:28:30 -0600 master-replica-0 adding 'trainer/__init__.py'
INFO 2017-02-16 06:28:30 -0600 master-replica-0 adding 'trainer/util.py'
INFO 2017-02-16 06:28:30 -0600 master-replica-0 adding 'trainer/preprocess.py'
INFO 2017-02-16 06:28:30 -0600 master-replica-0 adding 'trainer-0.1.dist-info/DESCRIPTION.rst'
INFO 2017-02-16 06:28:30 -0600 master-replica-0 adding 'trainer-0.1.dist-info/metadata.json'
INFO 2017-02-16 06:28:30 -0600 master-replica-0 adding 'trainer-0.1.dist-info/top_level.txt'
INFO 2017-02-16 06:28:30 -0600 master-replica-0 adding 'trainer-0.1.dist-info/METADATA'
INFO 2017-02-16 06:28:30 -0600 master-replica-0 adding 'trainer-0.1.dist-info/RECORD'
INFO 2017-02-16 06:28:30 -0600 master-replica-0 Running setup.py bdist_wheel for trainer: finished with status 'done'
INFO 2017-02-16 06:28:30 -0600 master-replica-0 Stored in directory: /root/.cache/pip/wheels/e8/0c/c7/b77d64796dbbac82503870c4881d606fa27e63942e07c75f0e
INFO 2017-02-16 06:28:30 -0600 master-replica-0 Successfully built trainer
INFO 2017-02-16 06:28:30 -0600 master-replica-0 Installing collected packages: trainer
INFO 2017-02-16 06:28:30 -0600 master-replica-0 Successfully installed trainer-0.1
INFO 2017-02-16 06:28:31 -0600 master-replica-0 Running command: python -m trainer.task --output_path gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/training --eval_data_paths gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/eval* --train_data_paths gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/train*
INFO 2017-02-16 06:28:34 -0600 master-replica-0 Original job data: {u'package_uris': [u'gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz'], u'args': [u'--output_path', u'gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/training', u'--eval_data_paths', u'gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/eval*', u'--train_data_paths', u'gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/train*'], u'python_module': u'trainer.task', u'region': u'us-central1'}
INFO 2017-02-16 06:28:34 -0600 master-replica-0 setting eval batch size to 100
INFO 2017-02-16 06:28:34 -0600 master-replica-0 Starting master/0
INFO 2017-02-16 06:28:34 -0600 master-replica-0 Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
INFO 2017-02-16 06:28:34 -0600 master-replica-0 Started server with target: grpc://localhost:2222
WARNING 2017-02-16 06:28:35 -0600 master-replica-0 From /root/.local/lib/python2.7/site-packages/trainer/task.py:211 in run_training.: merge_all_summaries (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
WARNING 2017-02-16 06:28:35 -0600 master-replica-0 Instructions for updating:
WARNING 2017-02-16 06:28:35 -0600 master-replica-0 Please switch to tf.summary.merge_all.
WARNING 2017-02-16 06:28:35 -0600 master-replica-0 From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/logging_ops.py:270 in merge_all_summaries.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
WARNING 2017-02-16 06:28:35 -0600 master-replica-0 Instructions for updating:
WARNING 2017-02-16 06:28:35 -0600 master-replica-0 Please switch to tf.summary.merge.
INFO 2017-02-16 06:28:36 -0600 master-replica-0 Start master session 538be2b71d17c4dc with config:
ERROR. 2017-02-16 06:28:36 -0600 master-replica-0 device_filters: "/job:ps"
ERROR. 2017-02-16 06:28:36 -0600 master-replica-0 device_filters: "/job:master/task:0"
INFO 2017-02-16 06:28:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:30:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:32:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:34:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:36:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:38:39 -0600 master-replica-0 global_step/sec: 0
INFO 2017-02-16 06:40:39 -0600 master-replica-0 global_step/sec: 0