I am following the flowers tutorials for re-training inception on google cloud ml. I can run the tutorial, train, predict, just fine.
I then substituted the flowers dataset for a test dataset of my own. Optical character recognition of image digits.
My full code is here
Dict File for labels
Eval set
Training Set
Running from recent docker build provided by google.
`docker run -it -p "127.0.0.1:8080:8080" --entrypoint=/bin/bash gcr.io/cloud-datalab/datalab:local-20161227
I can preprocess files, and submit the training job using
# Submit training job.
gcloud beta ml jobs submit training "$JOB_ID" \
--module-name trainer.task \
--package-path trainer \
--staging-bucket "$BUCKET" \
--region us-central1 \
-- \
--output_path "${GCS_PATH}/training" \
--eval_data_paths "${GCS_PATH}/preproc/eval*" \
--train_data_paths "${GCS_PATH}/preproc/train*"
but it never makes it past global step 0. The flowers tutorial trained in about ~ 1 hr on the free tier. I have let my training go as long as 11 hrs. No movement.
Looking over at stackdriver, nothing progresses.
I have also tried a tiny toy dataset of 20 training images, and 10 eval images. Same issue.
The GCS Bucket ends up looking like this
Perhaps unsurprisingly, I can't visualize this log in tensorboard, nothing to show.
Full training log:
INFO 2017-01-10 17:22:00 +0000 unknown_task Validating job requirements...
INFO 2017-01-10 17:22:01 +0000 unknown_task Job creation request has been successfully validated.
INFO 2017-01-10 17:22:01 +0000 unknown_task Job MeerkatReader_MeerkatReader_20170110_170701 is queued.
INFO 2017-01-10 17:22:07 +0000 unknown_task Waiting for job to be provisioned.
INFO 2017-01-10 17:22:07 +0000 unknown_task Waiting for TensorFlow to start.
INFO 2017-01-10 17:22:10 +0000 master-replica-0 Running task with arguments: --cluster={"master": ["master-d4f6-0:2222"]} --task={"type": "master", "index": 0} --job={
INFO 2017-01-10 17:22:10 +0000 master-replica-0 "package_uris": ["gs://api-project-773889352370-ml/MeerkatReader_MeerkatReader_20170110_170701/f78d90a60f615a2d108d06557818eb4f82ffa94a/trainer-0.1.tar.gz"],
INFO 2017-01-10 17:22:10 +0000 master-replica-0 "python_module": "trainer.task",
INFO 2017-01-10 17:22:10 +0000 master-replica-0 "args": ["--output_path", "gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/training", "--eval_data_paths", "gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/eval*", "--train_data_paths", "gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/train*"],
INFO 2017-01-10 17:22:10 +0000 master-replica-0 "region": "us-central1"
INFO 2017-01-10 17:22:10 +0000 master-replica-0 } --beta
INFO 2017-01-10 17:22:10 +0000 master-replica-0 Downloading the package: gs://api-project-773889352370-ml/MeerkatReader_MeerkatReader_20170110_170701/f78d90a60f615a2d108d06557818eb4f82ffa94a/trainer-0.1.tar.gz
INFO 2017-01-10 17:22:10 +0000 master-replica-0 Running command: gsutil -q cp gs://api-project-773889352370-ml/MeerkatReader_MeerkatReader_20170110_170701/f78d90a60f615a2d108d06557818eb4f82ffa94a/trainer-0.1.tar.gz trainer-0.1.tar.gz
INFO 2017-01-10 17:22:12 +0000 master-replica-0 Building wheels for collected packages: trainer
INFO 2017-01-10 17:22:12 +0000 master-replica-0 creating '/tmp/tmpSgdSzOpip-wheel-/trainer-0.1-cp27-none-any.whl' and adding '.' to it
INFO 2017-01-10 17:22:12 +0000 master-replica-0 adding 'trainer/model.py'
INFO 2017-01-10 17:22:12 +0000 master-replica-0 adding 'trainer/util.py'
INFO 2017-01-10 17:22:12 +0000 master-replica-0 adding 'trainer/preprocess.py'
INFO 2017-01-10 17:22:12 +0000 master-replica-0 adding 'trainer/task.py'
INFO 2017-01-10 17:22:12 +0000 master-replica-0 adding 'trainer-0.1.dist-info/metadata.json'
INFO 2017-01-10 17:22:12 +0000 master-replica-0 adding 'trainer-0.1.dist-info/WHEEL'
INFO 2017-01-10 17:22:12 +0000 master-replica-0 adding 'trainer-0.1.dist-info/METADATA'
INFO 2017-01-10 17:22:12 +0000 master-replica-0 Running setup.py bdist_wheel for trainer: finished with status 'done'
INFO 2017-01-10 17:22:12 +0000 master-replica-0 Stored in directory: /root/.cache/pip/wheels/e8/0c/c7/b77d64796dbbac82503870c4881d606fa27e63942e07c75f0e
INFO 2017-01-10 17:22:12 +0000 master-replica-0 Successfully built trainer
INFO 2017-01-10 17:22:13 +0000 master-replica-0 Running command: python -m trainer.task --output_path gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/training --eval_data_paths gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/eval* --train_data_paths gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/train*
INFO 2017-01-10 17:22:14 +0000 master-replica-0 Starting master/0
INFO 2017-01-10 17:22:14 +0000 master-replica-0 Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
INFO 2017-01-10 17:22:14 +0000 master-replica-0 Started server with target: grpc://localhost:2222
ERROR 2017-01-10 17:22:16 +0000 master-replica-0 device_filters: "/job:ps"
INFO 2017-01-10 17:22:19 +0000 master-replica-0 global_step/sec: 0
Just repeating the last line until I kill it.
Is my mental model for this service incorrect? All suggestions welcome.