
I've been coding along this example of a convolution net in TensorFlow and I'm mystified by this allocation of weights:

weights = {

# 5x5 conv, 1 input, 32 outputs
'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),

# 5x5 conv, 32 inputs, 64 outputs
'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])), 

# fully connected, 7*7*64 inputs, 1024 outputs
'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])), 

# 1024 inputs, 10 outputs (class prediction)
'out': tf.Variable(tf.random_normal([1024, n_classes])) 


How do we know the 'wd1' weight matrix should have 7 x 7 x 64 rows?

It's later used to reshape the output of the second convolution layer:

# Fully connected layer
# Reshape conv2 output to fit dense layer input
dense1 = tf.reshape(conv2, [-1, _weights['wd1'].get_shape().as_list()[0]]) 

# Relu activation
dense1 = tf.nn.relu(tf.add(tf.matmul(dense1, _weights['wd1']), _biases['bd1']))

By my math, pooling layer 2 (conv2 output) has 4 x 4 x 64 neurons.

Why are we reshaping to [-1, 7*7*64]?


2 Answers


Working from the start:

The input, _X is of size [28x28x1] (ignoring the batch dimension). A 28x28 greyscale image.

The first convolutional layer uses PADDING=same, so it outputs a 28x28 layer, which is then passed to a max_pool with k=2, which reduces each dimension by a factor of two, resulting in a 14x14 spatial layout. conv1 has 32 outputs -- so the full per-example tensor is now [14x14x32].

This is repeated in conv2, which has 64 outputs, resulting in a [7x7x64].

tl;dr: The image starts as 28x28, and each maxpool reduces it by a factor of two in each dimension. 28/2/2 = 7.


This question requires you have a good understanding of deep learning convolutions.

Basically, each convolution layer your model has will reduce the convolutional pyramid transversal area. This reduction is made by the convolution stride and max_pooling stride. And to complicate things we have two options based on the PADDING.

Option 1 - PADDING='SAME'

out_height = ceil(float(in_height) / float(strides[1]))
out_width  = ceil(float(in_width) / float(strides[2]))

Option 2 - PADDING='VALID'

out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))

For EACH convolution and max pooling call you will have to calculate a new out_height and out_width. Then, in the end of the convolutions you multiply out_height, out_width and the depth of your last convolution layer. The result of this multiplication is the output feature map size which is the input of your first fully connected layer.

So, in your example you probably had only PADDING='SAME', a convolution stride of 1 and a max pooling stride of 2, twice. In the end you just had to divide everything by 4 (1,2,1,2).

More info at tensorflow API