0
votes

An error occurs when running in Cloud Shell the sample code Google's @SlavenBilac posted to train and classify images using Google Cloud Machine Learning and Cloud Dataflow.

The code gets stuck stuck at global_step/sec: 0

INFO    2017-02-16 06:28:36 -0600       master-replica-0                Start master session 538be2b71d17c4dc with config: 
ERROR   2017-02-16 06:28:36 -0600       master-replica-0                device_filters: "/job:ps"
ERROR   2017-02-16 06:28:36 -0600       master-replica-0                device_filters: "/job:master/task:0"
INFO    2017-02-16 06:28:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:30:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:32:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:34:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:36:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:38:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:40:39 -0600       master-replica-0                global_step/sec: 0
<keeps repeating until I kill the job>

Based on Google's @JoshGC answer to a similar question, I created an entirely new Google Cloud account (with new billing account, new project, etc) yesterday, then ran CloudShell setup script and the other steps to setup environment, then ran the sample code against the sample flower data. The error occurs (as shown below), so I don't think the cause is related to data nor my account configuration.

How can one modify the file(s) from GoogleCloudPlatform/cloudml-samples/flowers to avoid this error?

Excerpts:

Run sample code

cfinley3@wordthree-wordfour-7654321:~/google-cloud-ml/samples/flowers$ ./sample.sh

Your active configuration is: [cloudshell-18758]
Using job id:  flowers_cfinley3_20170216_045347

Preprocess seems ok

python trainer/preprocess.py \
  --input_dict "$DICT_FILE" \
  --input_path "gs://cloud-ml-data/img/flower_photos/train_set.csv" \
  --output_path "${GCS_PATH}/preprocess/train" \
  --cloud

Training starts

gcloud beta ml jobs submit training "$JOB_ID" \
  --module-name trainer.task \
  --package-path trainer \
  --staging-bucket "$BUCKET" \
  --region us-central1 \
  -- \
  --output_path "${GCS_PATH}/training" \
  --eval_data_paths "${GCS_PATH}/preproc/eval*" \
  --train_data_paths "${GCS_PATH}/preproc/train*"
Job [flowers_cfinley3_20170216_045347] submitted successfully.

Training gets stuck at at global_step/sec: 0

INFO    2017-02-16 06:24:48 -0600       unknown_task            Validating job requirements...
INFO    2017-02-16 06:24:48 -0600       unknown_task            Job creation request has been successfully validated.
INFO    2017-02-16 06:24:48 -0600       unknown_task            Job flowers_cfinley3_20170216_045347 is queued.
INFO    2017-02-16 06:24:55 -0600       unknown_task            Waiting for job to be provisioned.
INFO    2017-02-16 06:24:55 -0600       unknown_task            Waiting for TensorFlow to start.
INFO    2017-02-16 06:28:27 -0600       master-replica-0                Running task with arguments: --cluster={"master": ["master-9a431abe8e-0:2222"]} --task={"type": "master", "index": 0} --job={
INFO    2017-02-16 06:28:27 -0600       master-replica-0                  "package_uris": ["gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz"],
INFO    2017-02-16 06:28:27 -0600       master-replica-0                  "python_module": "trainer.task",
INFO    2017-02-16 06:28:27 -0600       master-replica-0                  "args": ["--output_path", "gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/training", "--eval_data_paths", "gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/eval*", "--train_data_paths", "gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/train*"],
INFO    2017-02-16 06:28:27 -0600       master-replica-0                  "region": "us-central1"
INFO    2017-02-16 06:28:27 -0600       master-replica-0                } --beta
INFO    2017-02-16 06:28:28 -0600       master-replica-0                Running module trainer.task.
INFO    2017-02-16 06:28:28 -0600       master-replica-0                Running command: gsutil -q cp gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz trainer-0.1.tar.gz
INFO    2017-02-16 06:28:29 -0600       master-replica-0                Installing the package: gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz
INFO    2017-02-16 06:28:29 -0600       master-replica-0                Running command: pip install --user --upgrade --force-reinstall trainer-0.1.tar.gz
INFO    2017-02-16 06:28:29 -0600       master-replica-0                Processing ./trainer-0.1.tar.gz
INFO    2017-02-16 06:28:30 -0600       master-replica-0                Building wheels for collected packages: trainer
INFO    2017-02-16 06:28:30 -0600       master-replica-0                  Running setup.py bdist_wheel for trainer: started
INFO    2017-02-16 06:28:30 -0600       master-replica-0                creating '/tmp/tmpn9HeiIpip-wheel-/trainer-0.1-cp27-none-any.whl' and adding '.' to it
INFO    2017-02-16 06:28:30 -0600       master-replica-0                adding 'trainer/model.py'
INFO    2017-02-16 06:28:30 -0600       master-replica-0                adding 'trainer/__init__.py'
INFO    2017-02-16 06:28:30 -0600       master-replica-0                adding 'trainer/util.py'
INFO    2017-02-16 06:28:30 -0600       master-replica-0                adding 'trainer/preprocess.py'
INFO    2017-02-16 06:28:30 -0600       master-replica-0                adding 'trainer-0.1.dist-info/DESCRIPTION.rst'
INFO    2017-02-16 06:28:30 -0600       master-replica-0                adding 'trainer-0.1.dist-info/metadata.json'
INFO    2017-02-16 06:28:30 -0600       master-replica-0                adding 'trainer-0.1.dist-info/top_level.txt'
INFO    2017-02-16 06:28:30 -0600       master-replica-0                adding 'trainer-0.1.dist-info/METADATA'
INFO    2017-02-16 06:28:30 -0600       master-replica-0                adding 'trainer-0.1.dist-info/RECORD'
INFO    2017-02-16 06:28:30 -0600       master-replica-0                  Running setup.py bdist_wheel for trainer: finished with status 'done'
INFO    2017-02-16 06:28:30 -0600       master-replica-0                  Stored in directory: /root/.cache/pip/wheels/e8/0c/c7/b77d64796dbbac82503870c4881d606fa27e63942e07c75f0e
INFO    2017-02-16 06:28:30 -0600       master-replica-0                Successfully built trainer
INFO    2017-02-16 06:28:30 -0600       master-replica-0                Installing collected packages: trainer
INFO    2017-02-16 06:28:30 -0600       master-replica-0                Successfully installed trainer-0.1
INFO    2017-02-16 06:28:31 -0600       master-replica-0                Running command: python -m trainer.task --output_path gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/training --eval_data_paths gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/eval* --train_data_paths gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/train*
INFO    2017-02-16 06:28:34 -0600       master-replica-0                Original job data: {u'package_uris': [u'gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz'], u'args': [u'--output_path', u'gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/training', u'--eval_data_paths', u'gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/eval*', u'--train_data_paths', u'gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/train*'], u'python_module': u'trainer.task', u'region': u'us-central1'}
INFO    2017-02-16 06:28:34 -0600       master-replica-0                setting eval batch size to 100
INFO    2017-02-16 06:28:34 -0600       master-replica-0                Starting master/0
INFO    2017-02-16 06:28:34 -0600       master-replica-0                Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
INFO    2017-02-16 06:28:34 -0600       master-replica-0                Started server with target: grpc://localhost:2222
WARNING 2017-02-16 06:28:35 -0600       master-replica-0                From /root/.local/lib/python2.7/site-packages/trainer/task.py:211 in run_training.: merge_all_summaries (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
WARNING 2017-02-16 06:28:35 -0600       master-replica-0                Instructions for updating:
WARNING 2017-02-16 06:28:35 -0600       master-replica-0                Please switch to tf.summary.merge_all.
WARNING 2017-02-16 06:28:35 -0600       master-replica-0                From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/logging_ops.py:270 in merge_all_summaries.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
WARNING 2017-02-16 06:28:35 -0600       master-replica-0                Instructions for updating:
WARNING 2017-02-16 06:28:35 -0600       master-replica-0                Please switch to tf.summary.merge.
INFO    2017-02-16 06:28:36 -0600       master-replica-0                Start master session 538be2b71d17c4dc with config: 
ERROR.  2017-02-16 06:28:36 -0600       master-replica-0                device_filters: "/job:ps"
ERROR.  2017-02-16 06:28:36 -0600       master-replica-0                device_filters: "/job:master/task:0"
INFO    2017-02-16 06:28:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:30:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:32:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:34:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:36:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:38:39 -0600       master-replica-0                global_step/sec: 0
INFO    2017-02-16 06:40:39 -0600       master-replica-0                global_step/sec: 0
1

1 Answers

1
votes

See this similar question. Check your input data files to make sure they aren't empty. If your data files are empty that can cause this behavior as TF waits for data forever.