I am trying to train my deep learning code using Keras with tensorflow backend on a remote server with GPU. However, even the GPU server states OOM.
This was the output:
2018-02-09 14:19:28.918619: I tensorflow/core/common_runtime/bfc_allocator.cc:685] Stats: Limit: 10658837300 InUse: 10314885120 MaxInUse: 10349312000 NumAllocs: 8762 MaxAllocSize: 1416551936
2018-02-09 14:19:28.918672: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ************__********************************************************************************xxxxxx 2018-02-09 14:19:28.918745: W tensorflow/core/framework/op_kernel.cc:1182] Resource exhausted: OOM when allocating tensor of shape [13772,13772] and type float 2018-02-09 14:19:29.294784: E tensorflow/core/common_runtime/executor.cc:643] Executor failed to create kernel. Resource exhausted: OOM when allocating tensor of shape [13772,13772] and type float [[Node: training_4/RMSprop/zeros = Constdtype=DT_FLOAT, value=Tensor, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Would there be any ways to resolve this issue? I tried adjusting for batch size, it initially worked when batch size with 100 but when i reduced it to 50, it showed this error. Afterwhich i tried batch size 100 but it displayed this same error again.
I tried to search on how to suspend training binary while running evaluation but did not get much.
Would greatly appreciate your help in this! Thank you!!