0
votes

I submitted a training job to Cloud ML Engine but it failed with an out-of-memory error. How can I specify more memory for my job?

1

1 Answers

2
votes

If you don't specify --scale-tier in your Cloud ML Engine job, you are using BASIC which is a single CPU machine with 4 GB of memory.

To use a 8-CPU machine that has 52 GB of memory:

(1) Create a file named largemachine.yaml with this content

trainingInput:
  scaleTier: CUSTOM
  masterType: large_model

(2) Add this to your ml-engine job submission:

gcloud ml-engine jobs submit training $JOB_NAME \
  ...
  --scale-tier=CUSTOM \
  --config=largemachine.yaml \
  -- \
  ...

See this page for other machine types (including GPU types) you can use: https://cloud.google.com/ml-engine/docs/tensorflow/machine-types#compare-machine-types