Trouble understanding Convolutional Neural Network

Question

I read about convolutional neural networks from here. Then I started playing with torch7. I am having confusion with the convolutional layer of a CNN.

From the tutorial,

1

The neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner.

2

For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field is of size 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights.

3

if the input layer is [32x32x3], CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and the region they are connected to in the input volume. This may result in volume such as [32x32x12].

I started playing with what a CONV layer might do to an image. I did that in torch7. Here is my implementation,

require 'image'
require 'nn'

i = image.lena()

model = nn.Sequential()
model:add(nn.SpatialConvolutionMM(3, 10, 5, 5)) --depth = 3, #output layer = 10, filter = 5x5

res = model:forward(i)
itorch.image(res)
print(#i)
print(#res)

output

  3
 512
 512
[torch.LongStorage of size 3]

  10
 508
 508
[torch.LongStorage of size 3]

Now lets see the structure of a CNN

So, my questions are,

Question 1

Is the convolution done like this - lets say we take an image 32x32x3. And there is 5x5 filter. Then the 5x5 filter will pass through the whole 32x32 image and produce the convoluted images? Okay, so sliding 5x5 filter across the whole image, we get one image, if there are 10 output layers, we get 10 images(as you see from the output). How do we get these? (see the image for clarification if required)

Question 2

What is the number of neurons in the conv layer? Is it the number of output layers? In the code I've written above, model:add(nn.SpatialConvolutionMM(3, 10, 5, 5)). Is it 10? (no. of output layers?)

If so the point number 2 does not make any sense. According to that If the receptive field is of size 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights. So what will be the weight here? I am greatly confused in this. In the model defined in torch, there is no weight. So how the weight is playing a role here?

Can someone explain what is going on?

aleju aleju · Accepted Answer · 2016-01-16T19:55:28

Is the convolution done like this - lets say we take an image 32x32x3. And there is 5x5 filter. Then the 5x5 filter will pass through the whole 32x32 image and produce the convoluted images?

For a 32x32x3 input image a 5x5 filter would iterate over every single pixel and for each pixel look at the 5x5 neighborhood. That neighborhood contains 5*5*3=75 values. Below is an example image for a 3x3 filter over a single input channel, i.e. one with a neighborhood of 3*3*1 values (source).

For each individual neighbor the filter will have one parameter (aka weight), so 75 parameters. Then to calculate one single output value (value at pixel x, y) it reads those neighbor values, multiplies each one with the respective parameter/weight and adds those up at the end (see discrete convolution). The optimal weights have to be learned during the training.

So one filter will iterate over the image and generate a new output, pixel by pixel. If you have multiple filters (i.e. the second parameter in SpatialConvolutionMM is >1) you get multiple outputs ("planes" in torch).

Okay, so sliding 5x5 filter across the whole image, we get one image, if there are 10 output layers, we get 10 images(as you see from the output). How do we get these? (see the image for clarification if required)

Each output plane gets generated by its own filter. Each filter has its own parameters (5*5*3 parameters in your example). The process for multiple filters is exactly the same as it is for one.

What is the number of neurons in the conv layer? Is it the number of output layers? In the code I've written above, model:add(nn.SpatialConvolutionMM(3, 10, 5, 5)). Is it 10? (no. of output layers?)

You should call them weights or parameters, "neurons" doesn't really fit for convolutional layers. The number of parameters is, as described, 5*5*3=75 per filter in your example. As you have 10 filters ("output planes") you have 750 parameters total. If you add a second layer to your network with model:add(nn.SpatialConvolutionMM(10, 10, 5, 5)) you would have an additional 5*5*10=250 parameters per filter and 250*10=2500 total. Notice how that number can quickly grow (512 filters/output planes in one layer operating on 256 input planes is nothing uncommon).

For further reading you should look at http://neuralnetworksanddeeplearning.com/chap6.html . Scroll down to the chapter "Introducing convolutional networks". Under "Local receptive fields" there are visualizations which will probably help you understand what a filter does (one was shown above).

Trouble understanding Convolutional Neural Network

1

2

3

Question 1

Question 2

2 Answers