0
votes

I've been trying the TensorFlow tutorial scripts on Google Cloud ML. In particular I've used the cifar10 CNN tutorial scripts at https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10.

When I run this training script in Google Cloud ML, there is a memory leak of around 0.5% per hour.

I have not made any changes to the scripts other than packaging them into the required GCP format (as described in https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer) and setting the data location to the storage bucket containing the .bin data files.

If I run locally i.e. not in Google Cloud, and use TCMALLOC, by setting LD_PRELOAD="/usr/lib/libtcmalloc.so", the memory leak is resolved. However, I do not have this option with Google Cloud ML.

What could be causing the leak, and what can I do to fix this? Why aren't other users noticing the same problem? Although the leak is small, it is big enough to cause my training sessions to run out of memory and fail, when I run against my own data for several days. The leak happens regardless of the number of GPUs I use.

The gcloud command I used is:

gcloud ml-engine jobs submit training cifar10_job --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.cifar10_multi_gpu_train --region europe-west1 --staging-bucket gs://tfoutput --scale-tier CUSTOM --config config.yml --runtime-version 1.0 -- --num_gpus=4

The config file (config.yml) is:

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m_gpu

Any help appreciated, thanks.

1
Can you share the output of running python -c "from google.protobuf.internal import api_implementation; print(api_implementation._default_implementation_type)" locally? Is it 'cpp'?rhaertel80
@rhaertel80 yes it is 'cpp'Chris
that matches the output in CloudML engine. We'll continue investigating.rhaertel80
Also, we recommend using github.com/tensorflow/models/pull/1538 which has huge performance benefits, possibly enough to get you through training while we investigaterhaertel80
Thanks @rhaertel80 that does seem to be much better on both memory usage and performance.Chris

1 Answers

0
votes

We recommend using this version of the code:

github.com/tensorflow/models/pull/1538

Which has performance benefits (by running for less time, you're less prone to OOMs).

That of course, may not be the permanent fix, however, according to our testing, TensorFlow 1.2 appears to address the issue. TensorFlow 1.2 will be available soon on CloudML Engine. If you continue to have problems, please let us know.