What is being normalized by Keras/TensorFlow BatchNormalization

Question

My question is what is being normalized by BatchNormalization (BN).

I am asking, does BN normalize the channels for each pixel separately or for all the pixels together. And does it do it on a per image basis or on all the channels of the entire batch.

Specifically, BN is operating on X. Say, X.shape = [m,h,w,c]. So with axis=3, it is operating on the "c" dimension which is the number of channels (for rgb) or the number of feature maps.

So lets say the X is an rgb and thus has 3 channels. Does the BN do the following: (this is a simplified version of the BN to discuss the dimensional aspects. I understand that gamma and beta are learned but not concerned with that here.)

For each image=X in m:

For each pixel (h,w) take the mean of the associated r, g, & b values.
For each pixel (h,w) take the variance of the associated r, g, & b values
Do r = (r-mean)/var, g = (g-mean)/var, & b = (b-mean)/var, where r, g, & b are the red, green, & blue channels of X respectively.
Then repeat this process for the next image in m,

In keras, the docs for BatchNormalization says:

axis: Integer, the axis that should be normalized (typically the features axis).

For instance, after a Conv2D layer with data_format="channels_first", set axis=1 in BatchNormalization.

But what is it exactly doing along each dimension?

BN does mostly work like that. I don't understand the question. — Alexandre Passos
I am asking, does BN normalize the channels for each pixel separately or for all the pixels together. And does it do it on a per image basis or on all the channels of the entire batch. — Jon
For each pixel separately, normalized across the batch so the mean and stddev for each pixel in each batch are 0 and 1. — Alexandre Passos

Maxim Maxim · Accepted Answer · 2017-12-06T15:47:58

First up, there are several ways to apply batch normalization, which are even mentioned in the original paper specifically for convolutional neural networks. See the discussion in this question, which outlines the difference between a usual and convolutional BN, and also the reason why both approaches make sense.

Particularly keras.layers.BatchNormalization implements the convolutional BN, which means that for an input [m,h,w,c] it computes c means and standard deviations across m*h*w values. The shapes of the running mean, running std dev and gamma and beta variables are just (c,). The values across spatial dimensions (pixels), as well as across the batch, are shared.

So a more accurate algorithm would be: for each R, G, and B channel compute the mean/variance across all pixels and all images in this channel and apply the normalization.

What is being normalized by Keras/TensorFlow BatchNormalization

1 Answers