Problem with batch normalization in tensorflow

Question

I am having trouble understanding the implementation of batch normalization in Tensorflow. To illustrate, I have created a simple network with one input node, one hidden node, and one output node and run with 1 batch, with a batch size of 2. My input x consists of a scalar with 2 values (ie a batch size of 2), one set to 0 and other set to 1.

I run for one epoch, and write out the output from the hidden layer (before and after batch normalization) as well as the batch norm moving mean, variance, gamma, and beta.

Here is my code:

import tensorflow as tf

import numpy as np

N_HIDDEN_1 = 1
N_INPUT= 1
N_OUTPUT = 1

###########################################################

# DEFINE THE Network

# Define placeholders for data that will be fed in during execution
x = tf.placeholder(tf.float32, (None, N_INPUT))
y = tf.placeholder(tf.float32, (None, N_OUTPUT))
lx = tf.placeholder(tf.float32, [])
training = tf.placeholder_with_default(False, shape=(), name='training')

# Hidden layers with relu activation
with tf.variable_scope('hidden1'):
      hidden_1 = tf.layers.dense(x, N_HIDDEN_1, activation=None, use_bias=False)
      bn_1 = tf.layers.batch_normalization(hidden_1, training=training, momentum=0.5)
      bn_1x = tf.nn.relu(bn_1)

# Output layer
with tf.variable_scope('output'):
      predx = tf.layers.dense(bn_1x, N_OUTPUT, activation=None, use_bias=False)
      pred = tf.layers.batch_normalization(predx, training=training, momentum=0.5)

###########################################################

# Define the cost function that is optimized when
# training the network and the optimizer

cost = tf.reduce_mean(tf.square(pred-y))

optimizer = tf.train.AdamOptimizer(learning_rate=lx).minimize(cost)

bout1 = tf.global_variables('hidden1/batch_normalization/moving_mean:0')
bout2 = tf.global_variables('hidden1/batch_normalization/moving_variance:0')
bout3 = tf.global_variables('hidden1/batch_normalization/gamma:0')
bout4 = tf.global_variables('hidden1/batch_normalization/beta:0')

###########################################################

# Train network

init = tf.global_variables_initializer()
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.Session() as sess:

    sess.run(init)

    # Create dummy data
    batchx = np.zeros((2,1))
    batchy = np.zeros((2,1))
    batchx[0,0]=0.0
    batchx[1,0]=1.0
    batchy[0,0]=3.0
    batchy[1,0]=4.0

    _,_ = sess.run([optimizer, extra_update_ops], feed_dict={training: True, x:batchx, y:batchy, lx: 0.001})

    print('weight of hidden layer')
    W1 = np.array(sess.run(tf.global_variables('hidden1/dense/kernel:0')))
    W1x = np.sum(W1, axis=1)
    print(W1x)

    print()
    print('output from hidden layer, batch norm layer, and relu layer')
    hid1,b1,b1x = sess.run([hidden_1, bn_1, bn_1x], feed_dict={training: False, x:batchx})
    print('hidden_1', hid1)
    print('bn_1', b1)
    print('bn_1x', b1x)

    print()
    print('batchnorm parameters')
    print('moving mean', sess.run(bout1))
    print('moving variance', sess.run(bout2))
    print('gamma', sess.run(bout3))
    print('beta', sess.run(bout4))

Here is the output I get when I run the code:

weight of hidden layer [[1.404974]]

output from hidden layer, batch norm layer, and relu layer
hidden_1 [[0.      ]
          [1.404974]]

bn_1 [[-0.40697935]
      [ 1.215785  ]]

bn_1x [[0.      ]
      [1.215785]]

batchnorm parameters
moving mean [array([0.3514931], dtype=float32)]
moving variance [array([0.74709475], dtype=float32)]
gamma [array([0.999], dtype=float32)]
beta [array([-0.001], dtype=float32)]

I am puzzled by the resulting batchnorm parameters. In this particular case, the output from the hidden layer prior to applying the batch norm are the scalars 0 and 1.404974. But the batch norm parameter moving mean is 0.3514931. This is for the case where I use momentum = 0.5. It is not clear to me why the moving mean after 1 iteration is not exactly the average of 0 and 1.404974 in this case. I was under the impression that the momentum parameter would only kick in from the second batch on.

Any help would be much appreciated.

Olivier Dehaene Olivier Dehaene · Accepted Answer · 2018-11-10T16:42:08

Because you ran the optimizer, it's hard to know what is really happening inside: the hidden_1 values you are printing are not the ones that were used to update the batch norm statistics; they are the post update values.

Anyways, I don't really see the issue:

Moving mean original value = 0.0
batch mean value = (1.404974 - 0.0) / 2.0 = ~0.7
Moving mean value = momentum * Moving mean original value + (1 - momentum) * batch mean value
                  = 0.0 * 0.5 + (1 - 0.5) * 0.7
                  = 0.35

Moving variance original value = 1.0
batch variance value = ~0.5
Moving variance value = momentum * Moving variance original value + (1 - momentum) * batch variance value
                  = 1.0 * 0.5 + (1.0 - 0.5) * 0.5
                  = 0.75

Problem with batch normalization in tensorflow

1 Answers