I am having trouble understanding the implementation of batch normalization in Tensorflow. To illustrate, I have created a simple network with one input node, one hidden node, and one output node and run with 1 batch, with a batch size of 2. My input x consists of a scalar with 2 values (ie a batch size of 2), one set to 0 and other set to 1.
I run for one epoch, and write out the output from the hidden layer (before and after batch normalization) as well as the batch norm moving mean, variance, gamma, and beta.
Here is my code:
import tensorflow as tf
import numpy as np
N_HIDDEN_1 = 1
N_INPUT= 1
N_OUTPUT = 1
###########################################################
# DEFINE THE Network
# Define placeholders for data that will be fed in during execution
x = tf.placeholder(tf.float32, (None, N_INPUT))
y = tf.placeholder(tf.float32, (None, N_OUTPUT))
lx = tf.placeholder(tf.float32, [])
training = tf.placeholder_with_default(False, shape=(), name='training')
# Hidden layers with relu activation
with tf.variable_scope('hidden1'):
hidden_1 = tf.layers.dense(x, N_HIDDEN_1, activation=None, use_bias=False)
bn_1 = tf.layers.batch_normalization(hidden_1, training=training, momentum=0.5)
bn_1x = tf.nn.relu(bn_1)
# Output layer
with tf.variable_scope('output'):
predx = tf.layers.dense(bn_1x, N_OUTPUT, activation=None, use_bias=False)
pred = tf.layers.batch_normalization(predx, training=training, momentum=0.5)
###########################################################
# Define the cost function that is optimized when
# training the network and the optimizer
cost = tf.reduce_mean(tf.square(pred-y))
optimizer = tf.train.AdamOptimizer(learning_rate=lx).minimize(cost)
bout1 = tf.global_variables('hidden1/batch_normalization/moving_mean:0')
bout2 = tf.global_variables('hidden1/batch_normalization/moving_variance:0')
bout3 = tf.global_variables('hidden1/batch_normalization/gamma:0')
bout4 = tf.global_variables('hidden1/batch_normalization/beta:0')
###########################################################
# Train network
init = tf.global_variables_initializer()
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.Session() as sess:
sess.run(init)
# Create dummy data
batchx = np.zeros((2,1))
batchy = np.zeros((2,1))
batchx[0,0]=0.0
batchx[1,0]=1.0
batchy[0,0]=3.0
batchy[1,0]=4.0
_,_ = sess.run([optimizer, extra_update_ops], feed_dict={training: True, x:batchx, y:batchy, lx: 0.001})
print('weight of hidden layer')
W1 = np.array(sess.run(tf.global_variables('hidden1/dense/kernel:0')))
W1x = np.sum(W1, axis=1)
print(W1x)
print()
print('output from hidden layer, batch norm layer, and relu layer')
hid1,b1,b1x = sess.run([hidden_1, bn_1, bn_1x], feed_dict={training: False, x:batchx})
print('hidden_1', hid1)
print('bn_1', b1)
print('bn_1x', b1x)
print()
print('batchnorm parameters')
print('moving mean', sess.run(bout1))
print('moving variance', sess.run(bout2))
print('gamma', sess.run(bout3))
print('beta', sess.run(bout4))
Here is the output I get when I run the code:
weight of hidden layer [[1.404974]]
output from hidden layer, batch norm layer, and relu layer
hidden_1 [[0. ]
[1.404974]]
bn_1 [[-0.40697935]
[ 1.215785 ]]
bn_1x [[0. ]
[1.215785]]
batchnorm parameters
moving mean [array([0.3514931], dtype=float32)]
moving variance [array([0.74709475], dtype=float32)]
gamma [array([0.999], dtype=float32)]
beta [array([-0.001], dtype=float32)]
I am puzzled by the resulting batchnorm parameters. In this particular case, the output from the hidden layer prior to applying the batch norm are the scalars 0 and 1.404974. But the batch norm parameter moving mean is 0.3514931. This is for the case where I use momentum = 0.5. It is not clear to me why the moving mean after 1 iteration is not exactly the average of 0 and 1.404974 in this case. I was under the impression that the momentum parameter would only kick in from the second batch on.
Any help would be much appreciated.