Multi-GPU training using tf.slim takes more time than single GPU

Question

I'm fine-tuning ResNet50 on the CIFAR10 dataset using tf.slim's train_image_classifier.py script:

python train_image_classifier.py \                    
  --train_dir=${TRAIN_DIR}/all \                                                        
  --dataset_name=cifar10 \                                                              
  --dataset_split_name=train \                                                          
  --dataset_dir=${DATASET_DIR} \                                                        
  --checkpoint_path=${TRAIN_DIR} \                                                      
  --model_name=resnet_v1_50 \                                                           
  --max_number_of_steps=3000 \                                                          
  --batch_size=32 \                                                                     
  --num_clones=4 \                                                                      
  --learning_rate=0.0001 \                                                              
  --save_interval_secs=10 \                                                             
  --save_summaries_secs=10 \                                                            
  --log_every_n_steps=10 \                                                                 
  --optimizer=sgd

For 3k steps, running this on a single GPU (Tesla M40) takes around 30mn, while running on 4 GPUs takes 50+ mn. (The accuracy is similar in both cases: ~75% and ~78%).

I know that one possible cause of delay in multi-GPU setups is loading the images, but in the case of tf.slim, it uses the CPU for that. Any ideas of what could be the issue? Thank you!

Timeline would help identify the performance bottleneck. Usage of timeline: stackoverflow.com/questions/36123740/… — Yao Zhang
@YaoZhang I've kept track of the GPU usage through nvidia-smi, and there are bursts of all 4 GPUs being used at around 90+% followed by moments of 0%, and chronically like this all throughout training. — Anas

bottlerun bottlerun · Accepted Answer · 2017-12-13T10:15:44

You will not get faster When set num_clones to use multi gpu. Because slim will train batch_size * num_clones data split in each of your GPU. After that calculate each loss by div num_clones and sum the total loss. (https://github.com/tensorflow/models/blob/master/research/slim/deployment/model_deploy.py)
When CPU become the bottleneck, input pipeline cannot product so much data for train. Then you will get 4 times slowly when set num_clones=4.(https://www.tensorflow.org/performance/performance_guide)

Multi-GPU training using tf.slim takes more time than single GPU

1 Answers