11
votes

I am doing research with semantic segmentation architectures. I need to speed up my training but don't know where to look further.

General information

  • images of shape (512,512,3)
  • 4 GPUs GeForce GTX 1080 11 GB GPU memory are available
  • 1 CPU Intel(R) Xeon(R) CPU E5-2637 v4 #3.50GHz is available
  • enough RAM
  • I use Keras
  • I use light data preprocessing (mainly cropping, not much data augmentation)

I have tried different approaches regarding data loading but every time bottleneck seems to be the CPU instead of the GPU. I run nvidia-smi and htop to see utilization.

What I have tried so far:

  • Keras + custom DataGenerator with 8 workers and 1 GPU model.fit_generator(generator=training_generator,use_multiprocessing=True, workers=8)

  • Keras + tf.data.dataset with data loaded from raw images model.fit(training_dataset.make_one_shot_iterator(),...)

    I tried both ways of prefetching:
    dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE)
    dataset = dataset.apply(tf.contrib.data.prefetch_to_device('/gpu:0'))

  • Keras + tf.data.dataset with data loaded from tf.Records
    => This option is next up.

Findings

  • Using multiple GPUs (which is quite easy to do with Keras) slows down the training as overhead calculations occupy the CPU.
  • Surprisingly the plain DataGenerator approach (no tf.data.dataset) is the fastest right now.
  • GPU utilization does go up to 100% for a very short time with every approach. But is also 0% sometimes.

I feel like right now, my processing chain looks like this:

data on disk -> CPU loads data in RAM -> CPU does data preprocessing -> CPU moves data to GPU -> GPU does training step

Thus the only way to speed up the training is to do all preprocessing up front and save the files to disk (will be huge with data augmentation). Then use tf.Records to load the files efficiently.

Do you have other ideas how to improve the speed of the training?

Update

I have tested my pipeline with two models.

Simple model

enter image description here

Complex model

enter image description here

Performance results

I trained 2 models for 3 epochs with 140 steps each (batch size = 3). Here are the results.

  1. Raw image data => Keras.DataGenerator
    simple model: 126s
    complex model: 154s

  2. Raw image data => tf.data.datasets
    simple model: 208s
    complex model: 215s

DataGenerator

Helper function

def load_image(self,path):
    image = cv2.cvtColor(cv2.imread(path,-1), cv2.COLOR_BGR2RGB)
    return image

Main part

#Collect a batch of images on the CPU step by step (probably the bottlebeck of the whole computation)
for i in range(len(image_filenames_tmp)):
    #print(image_filenames_tmp[i])
    #print(label_filenames_tmp[i])
    input_image = self.load_image(image_filenames_tmp[i])[: self.shape[0], : self.shape[1]]
    output_image = self.load_image(label_filenames_tmp[i])[: self.shape[0], : self.shape[1]]

    # Prep the data. Make sure the labels are in one-hot format
    input_image = np.float32(input_image) / 255.0
    output_image = np.float32(self.one_hot_it(label=output_image, label_values=label_values))
        
    input_image_batch.append(np.expand_dims(input_image, axis=0))
    output_image_batch.append(np.expand_dims(output_image, axis=0))

    input_image_batch = np.squeeze(np.stack(input_image_batch, axis=1))
    output_image_batch = np.squeeze(np.stack(output_image_batch, axis=1))            
    
    
return input_image_batch, output_image_batch

tf.data.dataset

Helper function

def preprocess_fn(train_image_filename, train_label_filename):
'''A transformation function to preprocess raw data
into trainable input. '''
     x = tf.image.decode_png(tf.read_file(train_image_filename))
     x = tf.image.convert_image_dtype(x,tf.float32,saturate=False,name=None)
    
     x = tf.image.resize_image_with_crop_or_pad(x,512,512)
            
     y = tf.image.decode_png(tf.read_file(train_label_filename))
     y = tf.image.resize_image_with_crop_or_pad(y,512,512)
            
     class_names, label_values = get_label_info(csv_path)
            
     semantic_map = []
     for colour in label_values:
         class_map = tf.reduce_all(tf.equal(y, colour), axis=-1)
         semantic_map.append(class_map)
         semantic_map = tf.stack(semantic_map, axis=-1)
         # NOTE cast to tf.float32 because most neural networks operate in float32.
      semantic_map = tf.cast(semantic_map, tf.float32)       
    
      return x, semantic_map
    

Main part

dataset = tf.data.Dataset.from_tensor_slices((train_image_filenames, train_label_filenames))

dataset = dataset.apply(tf.contrib.data.map_and_batch(
            preprocess_fn, batch_size,
            num_parallel_batches=4,  # cpu cores
            drop_remainder=True if is_training    
dataset = dataset.repeat()
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE) # automatically picks best buffer_size
    
3

3 Answers

3
votes

I am dealing with similar issues and trying to optimize the pipeline is an uphill battle. Using horovod instead of keras multi-gpu gives me almost a linear speed up, where as keras multi-gpu didn't: https://medium.com/omnius/keras-horovod-distributed-deep-learning-on-steroids-94666e16673d

tf.dataset is definitely the way to go. You might also want to do shuffle operation for better generalization.

Another thing that improved things a lot for me was resizing images beforehand and saving them with np.save() as .npy files. They take more space to save but reading them is an order of magnitude faster. I used tf.py_func() to convert my numpy operations to tensors (which can't be parallelized because of python GIL)

Nvidia recently release DALI. It does augmentation on the GPU which is definitely the way to go in the future. For simple classification task it might already have all the functionality that you need.

3
votes

How does your data processing pipeline look like exactly? Have you considered to omit some steps that might be too expensive? How is your data stored? Is it plain image files that are loaded on demand or do you have them pre-loaded to the memory before? Usually loading JPG/PNG images is very expensive.

Can you see any improvements if you increase max_queue_size in model.fit_generator()?

And finally, could you benchmark how fast your data processing pipeline is actually by for example just generate a few thousand batches and mesaure the time per batch?

Apart from this, my own experience is that a low GPU utilization might be observed when your model is relatively small / not computational expensive. As new data has to be fed to the GPU between batches, there is just an overhead you can't really avoid. When the ratio between this overhead and the actual computation time for a single pass is high, you might observe that your overall GPU utulization is relatively low and often even get 0% values.

Edit: Could you give us more information on the model you use, especially what kind of layers it mostly consists of. The computation time for a single pass of relatively small CNN for example, might be so short that more time is used by refeeding the GPU between batches than for the actual computations.

Update: After you added more information about your processing pipeline, I would say that your main bottleneck is the loading and decoding of the PNG images. PNG decompression (and compression even much more) is usually very expensive (according to this source about 5 times more than JPEG). To check this assumption, you could profile your processing pipeline by meassure how much time every processing step (decoding, resizing, cropping, etc.) needs and what is the main contributor.

Now there are many ways to optimize your processing pipeline:

  • It seems like you load plain, unprocessed PNG images with varying image sizes. You could at least resize each image file to the final size. This would save storage and should decrease the loading/decoding overhead.
  • Use JPEG instead. Quality difference for "real world" images is minimal between JPEG and PNG, but JPEG takes less space and decoding is less expensive.
  • If you have enough storage available, you may save whole batches of images as compressed Numpy arrays in the final format. This may use more space but also decrease the loading time drastically.
1
votes

You are correct about the processing chain.

What can result in a great increase in performance in my experience is parallelizing the data loading (if coming from a remote database for example) as well as the data pre-processing.

This way you can continue to process data for your next batch while training and ideally the processed data for your next batch is ready as soon as the last training step finished on the GPU.

If you have very very heavy pre-processing compared to a very fast training step this might not increase performance by much tough. Then I would say your best bet is to move the pre-processing also the GPU e.g. by using CUDA.

EDIT: Should that not help, I would suggest a more in-depth profiling. If it really is some processing part, think about how to speed up, or maybe it is some easy issue where lists instead of numpy is used for array manipulation. In the end, your only option would be to save the pre-processed data instead of calculating during runtime. An alternative solution might be to cache it after the first processing (depending on how much ram you have).