I am increasing the batch size as I increase the number of GPUs when training the AlexNet Model on ImageNet dataset. It works fine up to 4096 when I get OOM errors. I start with a batch size of 1024 on 4 GPUs, then 2048 on 8 GPUs. However, when I attempt 4096 on 16 GPUs, I get OOM. Ideally, this shouldn't happen because, in data parallelism, samples per GPU remain the same. I am using ChainerMN for the training.