0
votes

I am trying to submit a google cloud job that trains cnn model for mnist digit. since I am new to gcp, I want to train this job on f1-micro machines first for practice. but not successful. I have two issues along the way.

here's my systems. windows 10, anaconda, jupyter notebook 6, python 3.6, tf 1.13.0. at first my model works well without any gcp command involved. Then I packed the files into a module as the gcp course suggested. and use gcloud command for local train. the cell seems stuck and doing nothing until I close and halt the ipynb file. the training started right after it and results are correct as I monitored it on Tensorboard. what do I need to do to make it run normally from the cell without closing that notebook? btw I can make it run in a terminal without this issue though.

second issue, I then tried to do a submission to google cloud machine. I created a vm instance with f1-micro just to practice since it has a lot of free hours. but my command options aren't accepted. I tried a couple of format for the machine type. i can't set the machine type right. and how do I build the connection to the instance I have created?

any advice? thanks! codes are here.

#1.local submission lines


OUTDIR='trained_test'

INPDIR='..\data'
shutil.rmtree(path = OUTDIR, ignore_errors = True) 

!gcloud ai-platform local train \
    --module-name=trainer.task \
    --package-path=trainer \
    -- \
    --output_dir=$OUTDIR \
    --input_dir=$INPDIR \
    --epochs=2 \
    --learning_rate=0.001 \
    --batch_size=100


#2. submit to compute engine

OUTDIR='gs://'+BUCKET+'/digit/train_01'
INPDIR='gs://'+BUCKET+'/digit/data'
JOBNAME='kaggle_digit_01_'+datetime.now().strftime("%Y%m%d_%H%M%S")

!gcloud ai-platform jobs submit training $JOBNAME \
    --region=$REGION \
    --module-name=trainer.task \
    --package-path=trainer \
    --job-dir=$OUTDIR \
    --staging-bucket=gs://$BUCKET \
    --scale-tier=custom \
    --master-machine-type=zones/us-central1-a/machineTypes/f1-micro \
    --runtime-version 1.13 \
    -- \
    --output_dir=OUTDIR \
    --input_dir=INPDIR \
    --epochs=5 --learning_rate=0.001 --batch_size=100 \

Error message:

ERROR: (gcloud.ai-platform.jobs.submit.training) INVALID_ARGUMENT: Field: master_type Error: The specified machine type is not supported: zones/us-central1-a/machineTypes/f1-micro
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: 'The specified machine type is not supported: zones/us-central1-a/machineTypes/f1-micro'
    field: master_type

Update:

changing the machine type does work

--scale-tier=CUSTOM \
--master-machine-type=n1-standard-4 \

I also put the following at the beginning, so the notebook recognize the file format such as $jobname...

import gcsfs

btw --job-dir doesn't seem to matter.

however the local train still have the same issue that, I need to close and halt the notebook to kick off the training. could anyone give a suggestion on this?

1

1 Answers

0
votes

f1-micro is not supported by AI Platform Training. Here is the list of supported machines. Also you don't need to specify zone. just the machine type. I.e., --master-machine-type=n1-standard-4