I've been trying the TensorFlow tutorial scripts on Google Cloud ML. In particular I've used the cifar10 CNN tutorial scripts at https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10.
When I run this training script in Google Cloud ML, there is a memory leak of around 0.5% per hour.
I have not made any changes to the scripts other than packaging them into the required GCP format (as described in https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer) and setting the data location to the storage bucket containing the .bin data files.
If I run locally i.e. not in Google Cloud, and use TCMALLOC, by setting LD_PRELOAD="/usr/lib/libtcmalloc.so", the memory leak is resolved. However, I do not have this option with Google Cloud ML.
What could be causing the leak, and what can I do to fix this? Why aren't other users noticing the same problem? Although the leak is small, it is big enough to cause my training sessions to run out of memory and fail, when I run against my own data for several days. The leak happens regardless of the number of GPUs I use.
The gcloud command I used is:
gcloud ml-engine jobs submit training cifar10_job --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.cifar10_multi_gpu_train --region europe-west1 --staging-bucket gs://tfoutput --scale-tier CUSTOM --config config.yml --runtime-version 1.0 -- --num_gpus=4
The config file (config.yml) is:
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m_gpu
Any help appreciated, thanks.
python -c "from google.protobuf.internal import api_implementation; print(api_implementation._default_implementation_type)"
locally? Is it 'cpp'? – rhaertel80