3
votes

Introduction

From what I understood from CS231n Convolutional Neural Networks for Visual Recognition is that the Size of the output volume represents the number of neurones given the following parameters:

  1. Input volume size (W)
  2. The receptive field size of the Conv Layer neurons (F) which is the size of the kernel or filter
  3. Stride with which they are applied (S) or steps that we use to move the kernel
  4. Amount of zero padding used (P) on the border

I posted two examples. In example 1 I have no problem at all. But it's in example 2 that I get confused.


Example 1

In the Real-world example section they start with a [227 x 227 x 3] input image. The parameters are the following: F = 11, S = 4, P = 0, W = 227.

We note that the convolution has a depth of K = 96. (Why?)

The size of the output volume is (227 - 11)/4 + 1 = 55. So we will have 55 x 55 x 96 = 290,400 neurones each pointing (excuse me if butchered the term) to an [11 x 11 x 3] region in the image which is in fact the kernel where we want to compute the dot product.


Example 2

In the following example taken from the Numpy examples section. we have an input image with the following shape [11 x 11 x 3]. The parameters used to compute the size of the output Volume are the following: W = 11, P = 0, S = 2 and F = 5.

We note that the convolution has a depth of K = 4

The formula (11-5)/2+1 = 4 produces only 4 neurones. Each neurone points to a region of size [5 x 5 x 4] in the image.

It seems that they are moving the Kernel in the x direction only. Shouldn't we have 12 Neurones each having[5 x 5 x 4] weights.

V[0,0,0] = np.sum(X[:5,:5,:]   * W0) + b0
V[1,0,0] = np.sum(X[2:7,:5,:]  * W0) + b0
V[2,0,0] = np.sum(X[4:9,:5,:]  * W0) + b0
V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0

Questions

  • I really don't understand why only 4 neurones are used and not 12
  • Why did they pick K = 96 in example 1?
  • is the W parameter always equal to the width in the imput image?
1

1 Answers

2
votes

Example 1

Why that the convolution has a depth of K = 96?

The depth (K) is equals to the number of filters used on the convolutional layer. A Bigger number gives, usually, better results. The problem is: slower training. Complex images would required more filters. I usually starts tests with 32 filters on the first layer and 64 on the second layer.

Example 2

The formula (11-5)/2+1 = 4 produces only 4 neurones.

I'm no expert, but I think this is false. The formula only define the output size (height and width). A convolutional layer has the size (height and width) and the depth. The size is defined by this formula, the depth by the number of filters used. The total number of neurons is:

## height * width * depth
4 * 4 * 4 = 64

Questions

  1. The layer has 64 neurons, 16 for each depth slice.
  2. Bigger number of filters, usually better.
  3. As far as I know you need to calculate the height and width of the convolutional layer separately. When calculating the width of the output W will be the width of the image and F will be the width of the filters used. When calculating the height you will use the height of the image and filters. When the image and filters are squared you can do a single operation because both operations will have the same result.