I am trying to understand how the dimensions in convolutional neural network behave. In the figure below the input is 28-by-28 matrix with 1 channel. Then there are 32 5-by-5 filters (with stride 2 in height and width). So I understand that the result is 14-by-14-by-32. But then in the next convolutional layer we have 64 5-by-5 filters (again with stride 2). So why the result is 7-by-7- by 64 and not 7-by-7-by 32*64? Aren't we applying each one of the 64 filters to each one of the 32 channels?
3 Answers
One filter is the sum of all the dimensions in the previous layer. This means that the 5x5 filter sums up over all 32 dimensions and in essence is a weighted sum of 32*5*5 values. However the weight values are shared across dimensions. Then there are 64 such filters. A better explanation with images can be found here: http://cs231n.github.io/convolutional-networks/.
The depth is usually given implicitly. For example many Images are considered to have depth 3 (for the three color dimensions in each pixel). Then by a 5x5 filter we mean a 5x5x3 Filter. In your case the 5x5-Filter is really a 5x5x32 filter.
Depth one is usually explicitly stated (as in "5x5x1 filter").
hereis a clear explanation how the sizes of the inputs vary with proceeding among the layers.
In the input the dimensions that you are giving are 28 wide and 28 height and depth as 1. For filters in layer1 the depth dimension of filter must be equal to the depth of the input. so the dimension of the filter will be 5x5x1, applying one filter the dimension is reduced (due to strides)to produce 14x14x1 dimension activation map, so applying 32 such filters will give you 32 activations maps. Combining all of these 14x14x32 is output of the layer 1 and input to your second layer. Again in second layer you need to apply a filter of dimension 5(width)x5(height)x32(depth) on the layer to produce one activation map of 14x14x1 , stacking all the 64 activation maps give you output dimension of the second layer as 14x14x64 and so on.
Yes, You are actually applying on 64 filters on each of the 32 channels.