ResourceExhaustedError: OOM when allocating tensor

Question

I made my own implementation of AlexNet with one less Fully connected Layer to classify 102 classes of flowers. My training set consists of 11,000 images while validation and training set have 3,000 images each. I wrote these three datasets in HDF5 format and stored them on disk. I reloaded them and tried to pass the images through the network using batches of 8 and 75 epochs. However, a memory error occurred

I have already tried reducing the batch size to 8 and reduced the dimensions to 400x400 (original is 500x500) but no use

tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2019-08-23 00:19:47.336560: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:01:00.0 totalMemory: 4.00GiB freeMemory: 3.30GiB 2019-08-23 00:19:47.342432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2019-08-23 00:19:47.900540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-08-23 00:19:47.904687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2019-08-23 00:19:47.907033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2019-08-23 00:19:47.909380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3007 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-08-23 00:19:48.550001: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:49.089904: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:49.629533: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:50.067994: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:50.523258: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. Epoch 1/75 2019-08-23 00:20:14.632764: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally 2019-08-23 00:20:16.325917: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.410374: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 836.38MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.650565: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 429.27MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.716695: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.22GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.733003: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 637.52MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.782250: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 844.88MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.792756: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 429.27MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:25.135977: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 784.00MiB. Current allocation summary follows. 2019-08-23 00:20:25.143913: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):
Total Chunks: 104, Chunks in use: 99. 26.0KiB allocated for chunks. 24.8KiB in use in bin. 452B client-requested in use in bin. 2019-08-23 00:20:25.150353: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512):
Total Chunks: 16, Chunks in use: 14. 8.0KiB allocated for chunks. 7.0KiB in use in bin. 5.3KiB client-requested in use in bin. 2019-08-23 00:20:25.160812: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1024): Total Chunks: 49, Chunks in use: 49. 61.3KiB allocated for chunks. 61.3KiB in use in bin. 60.1KiB client-requested in use in bin. 2019-08-23 00:20:25.169944: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2048): Total Chunks: 4, Chunks in use: 4. 13.0KiB allocated for chunks. 13.0KiB in use in bin. 12.8KiB client-requested in use in bin. 2019-08-23 00:20:25.182025: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4096): Total Chunks: 1, Chunks in use: 0. 6.3KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.192454: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8192): Total Chunks: 1, Chunks in use: 0. 15.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.200847: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16384):
Total Chunks: 9, Chunks in use: 9. 144.8KiB allocated for chunks. 144.8KiB in use in bin. 144.0KiB client-requested in use in bin. 2019-08-23 00:20:25.209817: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (32768):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.219192: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (65536):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.228194: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (131072):
Total Chunks: 9, Chunks in use: 9. 1.17MiB allocated for chunks. 1.17MiB in use in bin. 1.16MiB client-requested in use in bin. 2019-08-23 00:20:25.236088: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (262144):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.245435: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (524288):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.254114: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1048576): Total Chunks: 8, Chunks in use: 7. 12.25MiB allocated for chunks. 11.22MiB in use in bin. 10.91MiB client-requested in use in bin. 2019-08-23 00:20:25.264209: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2097152):
Total Chunks: 14, Chunks in use: 14. 42.09MiB allocated for chunks. 42.09MiB in use in bin. 42.09MiB client-requested in use in bin. 2019-08-23 00:20:25.273799: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4194304):
Total Chunks: 13, Chunks in use: 13. 80.41MiB allocated for chunks. 80.41MiB in use in bin. 77.91MiB client-requested in use in bin. 2019-08-23 00:20:25.285089: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8388608):
Total Chunks: 13, Chunks in use: 13. 141.14MiB allocated for chunks. 141.14MiB in use in bin. 136.45MiB client-requested in use in bin. 2019-08-23 00:20:25.298520: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16777216):
Total Chunks: 4, Chunks in use: 4. 112.98MiB allocated for chunks. 112.98MiB in use in bin. 112.98MiB client-requested in use in bin. 2019-08-23 00:20:25.306979: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (33554432):
Total Chunks: 4, Chunks in use: 4. 183.11MiB allocated for chunks. 183.11MiB in use in bin. 183.11MiB client-requested in use in bin. 2019-08-23 00:20:25.315121: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (67108864):
Total Chunks: 1, Chunks in use: 0. 82.18MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.322194: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.331550: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (268435456): Total Chunks: 3, Chunks in use: 3. 2.30GiB allocated for chunks. 2.30GiB in use in bin. 2.30GiB client-requested in use in bin. 2019-08-23 00:20:25.342419: I tensorflow/core/common_runtime/bfc_allocator.cc:613] Bin for 784.00MiB was 256.00MiB, Chunk State: tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 2.87GiB 2019-08-23 00:20:50.049508: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats: Limit:
3153697177 InUse: 3086482944 MaxInUse:
3153574400 NumAllocs: 388 MaxAllocSize:
822083584

2019-08-23 00:20:50.061236: W tensorflow/core/common_runtime/bfc_allocator.cc:271] **************************************************************************************************__ 2019-08-23 00:20:50.066546: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[50176,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "train.py", line 80, in max_queue_size=8 * 2, verbose=1) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1426, in fit_generator initial_epoch=initial_epoch) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\engine\training_generator.py", line 191, in model_iteration batch_outs = batch_function(*batch_data) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1191, in train_on_batch outputs = self._fit_function(ins) # pylint: disable=not-callable File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\backend.py", line 3076, in call run_metadata=self.run_metadata) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\client\session.py", line 1439, in call run_metadata_ptr) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[50176,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node training/RMSprop/gradients/loss/kernel/Regularizer_5/Square_grad/Mul_1}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
     [[{{node ConstantFoldingCtrl/loss/activation_6_loss/broadcast_weights/assert_broadcastable/AssertGuard/Switch_0}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

BabaYaga BabaYaga · Accepted Answer · 2019-10-16T06:45:07

This is because the GPU memory is not free to be allocated for training, this might be due overloading of dataset in memory if not in batches. But you have used fit_generator so we can rule that out as it provides data for training in batches generating data while running in parallel.

Solution is to check what process is utilizing your GPU. If you are using nvidia GPU you can check process consuming GPU by nvidia-smi, or else you can also try PS -fA | grep python. This will show you which process is running and consuming GPU. Just get the process ID from PID column and kill the process by command kill -9 PID. Re-run the training, this time you GPU is free. I faced same problem and clearing GPU helped for me.

Note- All commands to be run in terminal.

ResourceExhaustedError: OOM when allocating tensor

1 Answers