Tensorflow-gpu runs out of host memory after exactly one epoch

Question

I've been trying to run the cyclegan-1 model (https://github.com/leehomyc/cyclegan-1) on the provided horse2zebra dataset in order to test my tensorflow-gpu install.

Everything seems to work fine at first, until the end of the first batch, when my system freezes up for a minute and I get this error:

2017-10-26 15:23:10.103303: E tensorflow/stream_executor/cuda/cuda_driver.cc:955] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-10-26 15:23:10.103321: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 8589934592
2017-10-26 15:23:10.103592: E tensorflow/stream_executor/cuda/cuda_driver.cc:955] failed to alloc 7730940928 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-10-26 15:23:10.103599: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 7730940928
./run_cyclegan_oct_26_2017: line 1: 15025 Killed                  python3 -m CycleGAN_TensorFlow.main --to_train=2 --log_dir=CycleGAN_TensorFlow/output/cyclegan/exp_01 --config_filename=CycleGAN_TensorFlow/configs/exp_01.json --checkpoint_dir=CycleGAN_TensorFlow/output/cyclegan/exp_01/20171026-005834

I searched similar problems and thought that this was caused by tensorflow trying to allocate RAM used for system processes.

However, after killing the x server and running from a tty, I got the same error at exactly the same place: right after it finishes processing the first batch.

It seems like tensorflow is trying to allocate around 8GB, though this is less than my system memory.

Is the problem that I need to limit Tensorflow's memory usage? I've read a lot about limiting its GPU memory usage but not RAM.

My setup:

Memory 15.6 GiB
Processor Intel core i5-4440 CPU @ 3.10Ghz x 4
Graphics GeForce GTX 1060 6GB/PCIe/SSE2
OS type 64-bit
Plenty of disk space
Using python3

Thanks!

Peter

I'm new to Tensorflow but I'm fairly sure CUDA_ERROR_OUT_OF_MEMORY signals that your GPU is out of memory, not a reference to your RAM. Your graphics card has 6GB of memory and you're trying to allocate 8.5GB and 7.7.GB. — JoshVarty
A couple of other posts said that 'host' memory meant RAM, but I could be wrong. Not sure what I'd do in either case, though. — pelillian
After your first epoch are you trying to check a validation dataset that's much larger than your mini batches? — JoshVarty
I don't think so; I'm just trying to run the model without any changes. — pelillian
In my case I DID have a larger batch size on the validation dataset than training, obvious in retrospect but I needed the hint. — J B

cojoc cojoc · Accepted Answer · 2019-09-20T23:45:23

I found this question when searching for the cause of my similar problem: program can run during training but got CUDA_ERROR_OUT_OF_MEMORY when trying to save the model with tf.train.saver.save. I post my solution here for others reference.

For my problem, the cause is the upper limit of the files allowed to open in a thread. Because I opened too many files in data generator without closing them, I hit the upper limit (default 1024). This causes the failure of opening model file for write. The problem is solved by increase the upper limit or close all the files before saving. It took me a long while to figure this out because the error message is not relevant to file I/O :(

Not sure if tf.train_and_evaluate will save file before evaluation. But one thing you can check is to make sure if the output file (e.g. model) can be opened properly.

Tensorflow-gpu runs out of host memory after exactly one epoch

1 Answers