Dropout vs BatchNormalization - Standard deviation issue
There is a big problem that appears when you mix these layers, especially when BatchNormalization
is right after Dropout
.
Dropouts try to keep the same mean of the outputs without dropouts, but it does change the standard deviation, which will cause a huge difference in the BatchNormalization between training and validation. (During training, the BatchNormalization
receives changed standard deviations, accumulates and stores them. During validation, the dropouts are turned off, the standard deviation is not a changed one anymore, but the original. But BatchNormalization
, because it's in validation, will not use the batch statistics, but the stored statistics, which will be very different from the batch statistics)
So, the first and most important rule is: don't place a BatchNormalization
after a Dropout
(or a SpatialDropout
).
Usually, I try to leave at least two convolutional/dense layers without any dropout before applying a batch normalization, to avoid this.
Dropout vs BatchNormalization - Changing the zeros to another value
Also important: the role of the Dropout
is to "zero" the influence of some of the weights of the next layer. If you apply a normalization after the dropout, you will not have "zeros" anymore, but a certain value that will be repeated for many units. And this value will vary from batch to batch. So, although there is noise added, you are not killing units as a pure dropout is supposed to do.
Dropout vs MaxPooling
The problem of using a regular Dropout
before a MaxPooling
is that you will zero some pixels, and then the MaxPooling
will take the maximum value, sort of ignoring part of your dropout. If your dropout happens to hit a maximum pixel, then the pooling will result in the second maximum, not in zero.
So, Dropout
before MaxPooling
reduces the effectiveness of the dropout.
SpatialDropout vs MaxPooling
But, a SpatialDropout
never hits "pixels", it only hits channels. When it hits a channel, it will zero all pixels for that channel, thus, the MaxPooling
will effectively result in zero too.
So, there is no difference between spatial dropout before of after the pooling. An entire "channel" will be zero in both orders.
BatchNormalization vs Activation
Depending on the activation function, using a batch normalization before it can be a good advantage.
For a 'relu'
activation, the normalization makes the model fail-safe against a bad luck case of "all zeros freeze a relu layer". It will also tend to guarantee that half of the units will be zero and the other half linear.
For a 'sigmoid'
or a 'tahn'
, the BatchNormalization
will guarantee that the values are within a healthy range, avoiding saturation and vanishing gradients (values that are too far from zero will hit an almost flat region of these functions, causing vanishing gradients).
There are people that say there are other advantages if you do the contrary, I'm not fully aware of these advantages, I like the ones I mentioned very much.
Dropout vs Activation
With 'relu'
, there is no difference, it can be proved that the results are exactly the same.
With activations that are not centerd, such as 'sigmoid'
putting a dropout before the activation will not result in "zeros", but in other values. For a sigmoid, the final results of the dropout before it would be 0.5.
If you add a 'tanh'
after a dropout, for instance, you will have the zeros, but the scaling that dropout applies to keep the same mean will be distorted by the tanh. (I don't know if this is a big problem, but might be)
MaxPooling vs Activation
I don't see much here. If the activation is not very weird, the final result would be the same.
Conclusions?
There are possibilities, but some are troublesome. I find the following order a good one and often use it
I would do something like
- Group1
- Conv
- BatchNorm
- Activation
- MaxPooling
- Dropout or SpatialDropout
- Group2
- Conv
- ----- (there was a dropout in the last group, no BatchNorm here)
- Activation
- MaxPooling
- Dropout or SpatialDropout (decide to use or not)
- After two groups without dropout