I am trying to submit a google cloud job that trains cnn model for mnist digit. since I am new to gcp, I want to train this job on f1-micro machines first for practice. but not successful. I have two issues along the way.
here's my systems. windows 10, anaconda, jupyter notebook 6, python 3.6, tf 1.13.0. at first my model works well without any gcp command involved. Then I packed the files into a module as the gcp course suggested. and use gcloud command for local train. the cell seems stuck and doing nothing until I close and halt the ipynb file. the training started right after it and results are correct as I monitored it on Tensorboard. what do I need to do to make it run normally from the cell without closing that notebook? btw I can make it run in a terminal without this issue though.
second issue, I then tried to do a submission to google cloud machine. I created a vm instance with f1-micro just to practice since it has a lot of free hours. but my command options aren't accepted. I tried a couple of format for the machine type. i can't set the machine type right. and how do I build the connection to the instance I have created?
any advice? thanks! codes are here.
#1.local submission lines
OUTDIR='trained_test'
INPDIR='..\data'
shutil.rmtree(path = OUTDIR, ignore_errors = True)
!gcloud ai-platform local train \
--module-name=trainer.task \
--package-path=trainer \
-- \
--output_dir=$OUTDIR \
--input_dir=$INPDIR \
--epochs=2 \
--learning_rate=0.001 \
--batch_size=100
#2. submit to compute engine
OUTDIR='gs://'+BUCKET+'/digit/train_01'
INPDIR='gs://'+BUCKET+'/digit/data'
JOBNAME='kaggle_digit_01_'+datetime.now().strftime("%Y%m%d_%H%M%S")
!gcloud ai-platform jobs submit training $JOBNAME \
--region=$REGION \
--module-name=trainer.task \
--package-path=trainer \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET \
--scale-tier=custom \
--master-machine-type=zones/us-central1-a/machineTypes/f1-micro \
--runtime-version 1.13 \
-- \
--output_dir=OUTDIR \
--input_dir=INPDIR \
--epochs=5 --learning_rate=0.001 --batch_size=100 \
Error message:
ERROR: (gcloud.ai-platform.jobs.submit.training) INVALID_ARGUMENT: Field: master_type Error: The specified machine type is not supported: zones/us-central1-a/machineTypes/f1-micro
- '@type': type.googleapis.com/google.rpc.BadRequest
fieldViolations:
- description: 'The specified machine type is not supported: zones/us-central1-a/machineTypes/f1-micro'
field: master_type
Update:
changing the machine type does work
--scale-tier=CUSTOM \
--master-machine-type=n1-standard-4 \
I also put the following at the beginning, so the notebook recognize the file format such as $jobname...
import gcsfs
btw --job-dir doesn't seem to matter.
however the local train still have the same issue that, I need to close and halt the notebook to kick off the training. could anyone give a suggestion on this?