0
votes

I'm fine-tuning ResNet50 on the CIFAR10 dataset using tf.slim's train_image_classifier.py script:

python train_image_classifier.py \                    
  --train_dir=${TRAIN_DIR}/all \                                                        
  --dataset_name=cifar10 \                                                              
  --dataset_split_name=train \                                                          
  --dataset_dir=${DATASET_DIR} \                                                        
  --checkpoint_path=${TRAIN_DIR} \                                                      
  --model_name=resnet_v1_50 \                                                           
  --max_number_of_steps=3000 \                                                          
  --batch_size=32 \                                                                     
  --num_clones=4 \                                                                      
  --learning_rate=0.0001 \                                                              
  --save_interval_secs=10 \                                                             
  --save_summaries_secs=10 \                                                            
  --log_every_n_steps=10 \                                                                 
  --optimizer=sgd  

For 3k steps, running this on a single GPU (Tesla M40) takes around 30mn, while running on 4 GPUs takes 50+ mn. (The accuracy is similar in both cases: ~75% and ~78%).

I know that one possible cause of delay in multi-GPU setups is loading the images, but in the case of tf.slim, it uses the CPU for that. Any ideas of what could be the issue? Thank you!

1
Timeline would help identify the performance bottleneck. Usage of timeline: stackoverflow.com/questions/36123740/… - Yao Zhang
@YaoZhang I've kept track of the GPU usage through nvidia-smi, and there are bursts of all 4 GPUs being used at around 90+% followed by moments of 0%, and chronically like this all throughout training. - Anas
This is better answered if you file an issue on Github - keveman

1 Answers

1
votes
  1. You will not get faster When set num_clones to use multi gpu. Because slim will train batch_size * num_clones data split in each of your GPU. After that calculate each loss by div num_clones and sum the total loss. (https://github.com/tensorflow/models/blob/master/research/slim/deployment/model_deploy.py)
  2. When CPU become the bottleneck, input pipeline cannot product so much data for train. Then you will get 4 times slowly when set num_clones=4.(https://www.tensorflow.org/performance/performance_guide)