Performance bottleneck on the CPU side

Question

I am doing research with semantic segmentation architectures. I need to speed up my training but don't know where to look further.

General information

images of shape (512,512,3)
4 GPUs GeForce GTX 1080 11 GB GPU memory are available
1 CPU Intel(R) Xeon(R) CPU E5-2637 v4 #3.50GHz is available
enough RAM
I use Keras
I use light data preprocessing (mainly cropping, not much data augmentation)

I have tried different approaches regarding data loading but every time bottleneck seems to be the CPU instead of the GPU. I run nvidia-smi and htop to see utilization.

What I have tried so far:

Keras + custom DataGenerator with 8 workers and 1 GPU model.fit_generator(generator=training_generator,use_multiprocessing=True, workers=8)
Keras + tf.data.dataset with data loaded from raw images model.fit(training_dataset.make_one_shot_iterator(),...)

I tried both ways of prefetching:
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE)
dataset = dataset.apply(tf.contrib.data.prefetch_to_device('/gpu:0'))
Keras + tf.data.dataset with data loaded from tf.Records
=> This option is next up.

Findings

Using multiple GPUs (which is quite easy to do with Keras) slows down the training as overhead calculations occupy the CPU.
Surprisingly the plain DataGenerator approach (no tf.data.dataset) is the fastest right now.
GPU utilization does go up to 100% for a very short time with every approach. But is also 0% sometimes.

I feel like right now, my processing chain looks like this:

data on disk -> CPU loads data in RAM -> CPU does data preprocessing -> CPU moves data to GPU -> GPU does training step

Thus the only way to speed up the training is to do all preprocessing up front and save the files to disk (will be huge with data augmentation). Then use tf.Records to load the files efficiently.

Do you have other ideas how to improve the speed of the training?

Update

I have tested my pipeline with two models.

Simple model

Complex model

Performance results

I trained 2 models for 3 epochs with 140 steps each (batch size = 3). Here are the results.

Raw image data => Keras.DataGenerator
simple model: 126s
complex model: 154s
Raw image data => tf.data.datasets
simple model: 208s
complex model: 215s

DataGenerator

Helper function

def load_image(self,path):
    image = cv2.cvtColor(cv2.imread(path,-1), cv2.COLOR_BGR2RGB)
    return image

Main part

#Collect a batch of images on the CPU step by step (probably the bottlebeck of the whole computation)
for i in range(len(image_filenames_tmp)):
    #print(image_filenames_tmp[i])
    #print(label_filenames_tmp[i])
    input_image = self.load_image(image_filenames_tmp[i])[: self.shape[0], : self.shape[1]]
    output_image = self.load_image(label_filenames_tmp[i])[: self.shape[0], : self.shape[1]]

    # Prep the data. Make sure the labels are in one-hot format
    input_image = np.float32(input_image) / 255.0
    output_image = np.float32(self.one_hot_it(label=output_image, label_values=label_values))
        
    input_image_batch.append(np.expand_dims(input_image, axis=0))
    output_image_batch.append(np.expand_dims(output_image, axis=0))

    input_image_batch = np.squeeze(np.stack(input_image_batch, axis=1))
    output_image_batch = np.squeeze(np.stack(output_image_batch, axis=1))            
    
    
return input_image_batch, output_image_batch

tf.data.dataset

Helper function

def preprocess_fn(train_image_filename, train_label_filename):
'''A transformation function to preprocess raw data
into trainable input. '''
     x = tf.image.decode_png(tf.read_file(train_image_filename))
     x = tf.image.convert_image_dtype(x,tf.float32,saturate=False,name=None)
    
     x = tf.image.resize_image_with_crop_or_pad(x,512,512)
            
     y = tf.image.decode_png(tf.read_file(train_label_filename))
     y = tf.image.resize_image_with_crop_or_pad(y,512,512)
            
     class_names, label_values = get_label_info(csv_path)
            
     semantic_map = []
     for colour in label_values:
         class_map = tf.reduce_all(tf.equal(y, colour), axis=-1)
         semantic_map.append(class_map)
         semantic_map = tf.stack(semantic_map, axis=-1)
         # NOTE cast to tf.float32 because most neural networks operate in float32.
      semantic_map = tf.cast(semantic_map, tf.float32)       
    
      return x, semantic_map

Main part

dataset = tf.data.Dataset.from_tensor_slices((train_image_filenames, train_label_filenames))

dataset = dataset.apply(tf.contrib.data.map_and_batch(
            preprocess_fn, batch_size,
            num_parallel_batches=4,  # cpu cores
            drop_remainder=True if is_training    
dataset = dataset.repeat()
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE) # automatically picks best buffer_size

RukTech RukTech · Accepted Answer · 2018-09-06T00:05:06

I am dealing with similar issues and trying to optimize the pipeline is an uphill battle. Using horovod instead of keras multi-gpu gives me almost a linear speed up, where as keras multi-gpu didn't: https://medium.com/omnius/keras-horovod-distributed-deep-learning-on-steroids-94666e16673d

tf.dataset is definitely the way to go. You might also want to do shuffle operation for better generalization.

Another thing that improved things a lot for me was resizing images beforehand and saving them with np.save() as .npy files. They take more space to save but reading them is an order of magnitude faster. I used tf.py_func() to convert my numpy operations to tensors (which can't be parallelized because of python GIL)

Nvidia recently release DALI. It does augmentation on the GPU which is definitely the way to go in the future. For simple classification task it might already have all the functionality that you need.