Introduction
From what I understood from CS231n Convolutional Neural Networks for Visual Recognition is that the Size of the output volume
represents the number of neurones given the following parameters:
- Input volume size (W)
- The receptive field size of the Conv Layer neurons (F) which is the size of the kernel or filter
- Stride with which they are applied (S) or steps that we use to move the kernel
- Amount of zero padding used (P) on the border
I posted two examples. In example 1 I have no problem at all. But it's in example 2 that I get confused.
Example 1
In the Real-world example section they start with a [227 x 227 x 3]
input image. The parameters are the following: F = 11, S = 4, P = 0, W = 227
.
We note that the convolution has a depth of K = 96
. (Why?)
The size of the output volume
is (227 - 11)/4 + 1 = 55
. So we will have 55 x 55 x 96 = 290,400
neurones each pointing (excuse me if butchered the term) to an [11 x 11 x 3]
region in the image which is in fact the kernel where we want to compute the dot product.
Example 2
In the following example taken from the Numpy examples section. we have an input image with the following shape [11 x 11 x 3]
. The parameters used to compute the size of the output Volume are the following: W = 11, P = 0, S = 2 and F = 5
.
We note that the convolution has a depth of K = 4
The formula (11-5)/2+1 = 4
produces only 4 neurones. Each neurone points to a region of size [5 x 5 x 4]
in the image.
It seems that they are moving the Kernel in the x direction only. Shouldn't we have 12 Neurones each having[5 x 5 x 4]
weights.
V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0
V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0
V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0
V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0
Questions
- I really don't understand why only 4 neurones are used and not 12
- Why did they pick
K = 96
in example 1? - is the W parameter always equal to the width in the imput image?