TF2 Object Detection - Custom training failed (OOM) after successful training in the past

Question

I train an object object detection model, based on pre-trained model from TF2 Object Detection efficientdet _d2_coco17_tpu-32. https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md

I changed pipeline.config as needed to this process (I did it many time before, on eff_d1 or ssd models from tf2 object detection zoo).

I succseful trained this model on batch size 2 and 10K steps. But when tried to train on 100K steps/ 50K steps/ 20K steps I'm getting a 00M error.

I cant understant why it might happen.

Training on GPU - Nvidia GeForce RTX 3070 Ubuntu 20.04 TF 2.4.1

Any ideas, Thank you

NAGA RAJ S NAGA RAJ S · Accepted Answer · 2021-04-07T05:03:42

I don't know what you did while your training. I give the reason. If you increase the batch size the oom error will occur. you need to check if any other unwanted processes are running with your GPU.Use the below command and end the unwanted process.

nvidia-smi

In a few cases image size causes the out-of-memory issues higher resolution uses higher memory. Calculation for image size

64×64 and batch of 32

the memory cost on GPU is:

32⋅32⋅3
(size of the image if in color)⋅32(size of batch)⋅32(bits for a float32) bits ≈3MB

you can use this flag to avoid OOM error

TF_FORCE_GPU_ALLOW_GROWTH=true

First option:

This code below corresponds to TF2.0’s 1st option.

config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)

Second option:

This code below corresponds to TF2.0’s 2nd option, but it sets memory fraction, not a definite value.

# change the memory fraction as you want...import tensorflow as tf
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

if you see shuffle allocation you can define the shuffle images allocation according to your GPU in pipeline.config sometimes it may cause the OOM error.

TF2 Object Detection - Custom training failed (OOM) after successful training in the past

1 Answers