1
votes

Specs

Docker container run on a machine with 16 GB Ram, 1x GTX1070 with 8GB, Ubuntu 16.04.3 LTS. Keras is set to use the GPU.

What I want to do

I want to calculate the convolution output for a set of 79726 images 245x245 (RGB) so I can then get predictions through a secondary model that is already trained. I am using the VGG16 model that comes with Keras.applications.

Code

model = VGG16(include_top=False)
tst_b_s = 200
test_batches = ImageDataGenerator().flow_from_directory(
    directory='test/',
    target_size=(245,245),
    batch_size=tst_b_s,
    shuffle=False,
  )
test_feats = model.predict_generator(test_batches, steps=test_batches.samples/tst_b_s, verbose=1)

Problem

The predict generator runs for a while, then it throws

ResourceExhaustedError: OOM when allocating tensor with shape[200,64,245,245] [[Node: block1_conv2/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](block1_conv1/Relu, block1_conv2/kernel/read)]] [[Node: block5_pool/MaxPool/_159 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_127_block5_pool/MaxPool", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Update using smaller batches (10)

The predicting process still halts, but with an internal error this time:

InternalError: Dst tensor is not initialized. [[Node: block5_pool/MaxPool/_159 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_127_block5_pool/MaxPool", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

There are no other processes using the GPU.

Thank you

3
It seems that you have set the batch size to be 200. It's a really huge value. Could you try to use smaller one?Marcin Możejko
are there any other processes running on the same GPU ? What does nvidia-smi say about the allocated memory e.g. when your application is not running ? Do you see any processes in the output of nvidia-smi which you recognize ?Andre Holzner

3 Answers

0
votes

You are using too much GPU memory. Try using a smaller batch size or making sure no other processes are running on the same GPU.

0
votes

The first error you have is because your GPU memory cannot handle the buffer generated by your neural network.

The second one is because Keras struggle to release Tensorflow sessions. You can release sessions explicitly:

tf.keras.backend.clear_session()

You can also check the process in use of your GPU by executing nvidia-smi into a shell. You'll see a process using whole the memory of your GPU. Then just "kill -9 " the process and you'll be able to execute once again your tensorflow code.

0
votes

To my best knowledge, batchsize will not affect inference result. Hence you can try a smaller approximate batchsize as long as you gpu can handle it, and NO need to worry about if smaller one will cause any problems.