I am playing with the BatchNormalization layer, and I can't quite figure out the numerical results I get.
Let's consider we use BatchNormalization for computer vision.
We have 4D tensors.
Dimensions are: batch size, image height, image width, channels.
If I understand correctly, what BatchNormalization will do is:
- At training time:
- for each batch, compute the mean MU and the standard deviation SIGMA. This is done per channel, and accross all rows and all columns of all images of the batch.
- keep an exponential moving average of MU (say MÛ) and of SIGMA (say SIĜMA) accross all batches
- use MÛ and SIĜMA to normalize pixels: normalized_pixel = ((input_pixel - MÛ) / sqrt(SIĜMA))
- an hyper-parameter epsilon is added to SIĜMA to prevent division by zero if SIĜMA becomes null at one point during training: normalized_pixel = ((input_pixel - MÛ) / sqrt(SIĜMA + epsilon))
- use a scale parameter GAMMA and an offset parameter BETA to re-scale normalized pixel: output_pixel = ((GAMMA x normalized_pixel) + BETA)
- GAMMA and BETA are trainable parameters, they are optimized during training
- At inference time:
- MÛ and SIĜMA are now fixed parameters, just like GAMMA and BETA
- Same computations apply
Now, here comes my question...
First, I am only interested in what happens at inference time. I don't care about training, and I consider MÛ, SIĜMA, GAMMA and BETA to be fixed parameters.
I wrote a piece of python to test BatchNormalization on a (1, 3, 4, 1) tensor. Since there is only one channel, MÛ, SIĜMA, GAMMA and BETA have only 1 element each.
I chose MÛ = 0.0, SIĜMA = 1.0, GAMMA = 1.0 and BETA = 0.0, so that BatchNormalization has no effect.
Here is the code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy
import keras
import math
input_batch = numpy.array(
[[
[[ 1.0], [ 2.0], [ 3.0], [ 4.0]],
[[ 5.0], [ 6.0], [ 7.0], [ 8.0]],
[[ 9.0], [10.0], [11.0], [12.0]]
]],
dtype=numpy.float32
)
MU = 0.0
SIGMA = 1.0
GAMMA = 1.0
BETA = 0.0
input_layer = keras.layers.Input(
shape = (
None,
None,
1
)
)
BatchNormalization_layer = keras.layers.BatchNormalization(
axis=-1,
#epsilon=0.0,
center=True,
scale=True
)(
input_layer
)
model = keras.models.Model(
inputs = [input_layer],
outputs = [BatchNormalization_layer]
)
model.layers[1].set_weights(
(
numpy.array([GAMMA], dtype=numpy.float32),
numpy.array([BETA], dtype=numpy.float32),
numpy.array([MU], dtype=numpy.float32),
numpy.array([SIGMA], dtype=numpy.float32),
)
)
print model.predict(input_batch)
print ((((input_batch - MU) / math.sqrt(SIGMA)) * GAMMA) + BETA)
When I write explicitely the computation ((((input_batch - MU) / math.sqrt(SIGMA)) * GAMMA) + BETA) using numpy, I get the expected results.
However, when I use the keras.layers.BatchNormalization layer to perform the computation, I get similar results, only there are some kind of rounding errors or imprecisions:
Using TensorFlow backend.
[[[[ 0.9995004]
[ 1.9990008]
[ 2.9985013]
[ 3.9980016]]
[[ 4.997502 ]
[ 5.9970026]
[ 6.996503 ]
[ 7.996003 ]]
[[ 8.995503 ]
[ 9.995004 ]
[10.994504 ]
[11.994005 ]]]]
[[[[ 1.]
[ 2.]
[ 3.]
[ 4.]]
[[ 5.]
[ 6.]
[ 7.]
[ 8.]]
[[ 9.]
[10.]
[11.]
[12.]]]]
When I play with the values of MU*, SIGMA*, GAMMA and BETA, the output is affected as expected, so I believe I provide the parameters correctly to the layer.
I also tried to set the hyper-parameter epsilon of the layer to 0.0. It changes the results a little bit, but doe snot solve the issue.
Using TensorFlow backend.
[[[[ 0.999995 ]
[ 1.99999 ]
[ 2.999985 ]
[ 3.99998 ]]
[[ 4.999975 ]
[ 5.99997 ]
[ 6.9999647]
[ 7.99996 ]]
[[ 8.999955 ]
[ 9.99995 ]
[10.999945 ]
[11.99994 ]]]]
[[[[ 1.]
[ 2.]
[ 3.]
[ 4.]]
[[ 5.]
[ 6.]
[ 7.]
[ 8.]]
[[ 9.]
[10.]
[11.]
[12.]]]]
Can someone explain what is going on?
Thanks,
Julien