I am doing research with semantic segmentation architectures. I need to speed up my training but don't know where to look further.
General information
- images of shape (512,512,3)
- 4 GPUs GeForce GTX 1080 11 GB GPU memory are available
- 1 CPU Intel(R) Xeon(R) CPU E5-2637 v4 #3.50GHz is available
- enough RAM
- I use Keras
- I use light data preprocessing (mainly cropping, not much data augmentation)
I have tried different approaches regarding data loading but every time bottleneck seems to be the CPU instead of the GPU. I run nvidia-smi
and htop
to see utilization.
What I have tried so far:
Keras + custom DataGenerator with 8 workers and 1 GPU
model.fit_generator(generator=training_generator,use_multiprocessing=True, workers=8)
Keras + tf.data.dataset with data loaded from raw images
model.fit(training_dataset.make_one_shot_iterator(),...)
I tried both ways of prefetching:
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE)
dataset = dataset.apply(tf.contrib.data.prefetch_to_device('/gpu:0'))
Keras + tf.data.dataset with data loaded from tf.Records
=> This option is next up.
Findings
- Using multiple GPUs (which is quite easy to do with Keras) slows down the training as overhead calculations occupy the CPU.
- Surprisingly the plain DataGenerator approach (no tf.data.dataset) is the fastest right now.
- GPU utilization does go up to 100% for a very short time with every approach. But is also 0% sometimes.
I feel like right now, my processing chain looks like this:
data on disk -> CPU loads data in RAM -> CPU does data preprocessing -> CPU moves data to GPU -> GPU does training step
Thus the only way to speed up the training is to do all preprocessing up front and save the files to disk (will be huge with data augmentation). Then use tf.Records to load the files efficiently.
Do you have other ideas how to improve the speed of the training?
Update
I have tested my pipeline with two models.
Simple model
Complex model
Performance results
I trained 2 models for 3 epochs with 140 steps each (batch size = 3). Here are the results.
Raw image data => Keras.DataGenerator
simple model: 126s
complex model: 154sRaw image data => tf.data.datasets
simple model: 208s
complex model: 215s
DataGenerator
Helper function
def load_image(self,path):
image = cv2.cvtColor(cv2.imread(path,-1), cv2.COLOR_BGR2RGB)
return image
Main part
#Collect a batch of images on the CPU step by step (probably the bottlebeck of the whole computation)
for i in range(len(image_filenames_tmp)):
#print(image_filenames_tmp[i])
#print(label_filenames_tmp[i])
input_image = self.load_image(image_filenames_tmp[i])[: self.shape[0], : self.shape[1]]
output_image = self.load_image(label_filenames_tmp[i])[: self.shape[0], : self.shape[1]]
# Prep the data. Make sure the labels are in one-hot format
input_image = np.float32(input_image) / 255.0
output_image = np.float32(self.one_hot_it(label=output_image, label_values=label_values))
input_image_batch.append(np.expand_dims(input_image, axis=0))
output_image_batch.append(np.expand_dims(output_image, axis=0))
input_image_batch = np.squeeze(np.stack(input_image_batch, axis=1))
output_image_batch = np.squeeze(np.stack(output_image_batch, axis=1))
return input_image_batch, output_image_batch
tf.data.dataset
Helper function
def preprocess_fn(train_image_filename, train_label_filename):
'''A transformation function to preprocess raw data
into trainable input. '''
x = tf.image.decode_png(tf.read_file(train_image_filename))
x = tf.image.convert_image_dtype(x,tf.float32,saturate=False,name=None)
x = tf.image.resize_image_with_crop_or_pad(x,512,512)
y = tf.image.decode_png(tf.read_file(train_label_filename))
y = tf.image.resize_image_with_crop_or_pad(y,512,512)
class_names, label_values = get_label_info(csv_path)
semantic_map = []
for colour in label_values:
class_map = tf.reduce_all(tf.equal(y, colour), axis=-1)
semantic_map.append(class_map)
semantic_map = tf.stack(semantic_map, axis=-1)
# NOTE cast to tf.float32 because most neural networks operate in float32.
semantic_map = tf.cast(semantic_map, tf.float32)
return x, semantic_map
Main part
dataset = tf.data.Dataset.from_tensor_slices((train_image_filenames, train_label_filenames))
dataset = dataset.apply(tf.contrib.data.map_and_batch(
preprocess_fn, batch_size,
num_parallel_batches=4, # cpu cores
drop_remainder=True if is_training
dataset = dataset.repeat()
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE) # automatically picks best buffer_size