1
votes

So I try to use multiple GPUs with Keras. When I run training_utils.py with the example program (given as comments inside the training_utils.py code), I end up with ResourceExhaustedError. nvidia-smi tells me that barely one of the four GPUs are working. Using one GPU works fine for other programs.

  • TensorFlow 1.3.0
  • Keras 2.0.8
  • Ubuntu 16.04
  • CUDA/cuDNN 8.0/6.0

Question: Anyone have any idea whats going on here?

Console output:

(...)

2017-10-26 14:39:02.086838: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***************************************************************************************************x 2017-10-26 14:39:02.086857: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[128,55,55,256] Traceback (most recent call last): File "test.py", line 27, in parallel_model.fit(x, y, epochs=20, batch_size=256) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1631, in fit validation_steps=validation_steps) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1213, in _fit_loop outs = f(ins_batch) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2331, in call **self.session_kwargs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run feed_dict_tensor, options, run_metadata) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run options, run_metadata) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,55,55,256] [[Node: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]] [[Node: training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

Caused by op u'replica_1/xception/block3_sepconv2/separable_conv2d', defined at: File "test.py", line 19, in parallel_model = multi_gpu_model(model, gpus=2) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/utils/training_utils.py", line 143, in multi_gpu_model outputs = model(inputs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 603, in call output = self.call(inputs, **kwargs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2061, in call output_tensors, _, _ = self.run_internal_graph(inputs, masks) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2212, in run_internal_graph output_tensors = _to_list(layer.call(computed_tensor, **kwargs)) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/convolutional.py", line 1221, in call dilation_rate=self.dilation_rate) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 3279, in separable_conv2d data_format=tf_data_format) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_impl.py", line 497, in separable_conv2d name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 397, in conv2d data_format=data_format, name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[128,55,55,256] [[Node: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]] [[Node: training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

EDIT (Added example code):

import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np

num_samples = 1000
height = 224
width = 224
num_classes = 100

with tf.device('/cpu:0'):
    model = Xception(weights=None,
                     input_shape=(height, width, 3),
                     classes=num_classes)

parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
                   optimizer='rmsprop')

x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))

parallel_model.fit(x, y, epochs=20, batch_size=128)
1
Either your model is too big for your devices, or you're using batches with too many elements.Daniel Möller
You are right, when I'm using a smaller batch size it works. I'm using four GTX 1080 Ti so I didn't think that running the example program would result in a size problem. The example program was meant for 8 GPUs as default. But how much space this program occupy, and how do you calculate it? Is it really more than 11 Gig x 4?NorwegianClassic
I don't know how to calculate....I think the best is the good old attempt and fail approach.Daniel Möller
Ok, thanks anyways. Your answer solved my problem.NorwegianClassic

1 Answers

2
votes

When encountering OOM/ResourceExhaustedError on GPU I believe changing (Reducing) batch size is the right option to try at first.

For different GPU you may need different batch size based on the GPU memory you have.

Recently I faced the similar type of problem, tweaked a lot to do the different type of experiment.

Here is the link to the question (also some tricks are included).

However, while reducing the size of the batch you may find that your training gets slower.