I am training a model for semantic segmentation. I am using a batch size of 10 images for training on a single GPU. I am simultaneously using the same hyper-parameters for training on a multi-GPU (3 GPUs) setup. For multi-GPU, I am using a batch size of 30 images, i.e., 10 images per GPU.
Theoretically, should the loss values per step in each epoch during training be the same range of values for both the single GPU and multi-GPU training procedures?
In my case, it is not what I am currently seeing during training. The loss of the multi-GPU is 5 times larger than the value of the loss I am getting from the single-GPU.
Any input/suggestion is welcome.