Deep learning with Pytorch: understanding the neural network example

Question

I'm reading the Pytorch documentation and I have a couple of questions about the neural network that is introduced. The documentation defines the following network:

import torch
import torch.nn as nn
import torch.nn.functional as F

    class Net(nn.Module):

        def __init__(self):
            super(Net, self).__init__()
            # 1 input image channel, 6 output channels, 3x3 square convolution
            # kernel
            self.conv1 = nn.Conv2d(1, 6, 3)
            self.conv2 = nn.Conv2d(6, 16, 3)
            # an affine operation: y = Wx + b
            self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
            self.fc2 = nn.Linear(120, 84)
            self.fc3 = nn.Linear(84, 10)

        def forward(self, x):
            # Max pooling over a (2, 2) window
            x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
            # If the size is a square you can only specify a single number
            x = F.max_pool2d(F.relu(self.conv2(x)), 2)
            x = x.view(-1, self.num_flat_features(x))
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return x

        def num_flat_features(self, x):
            size = x.size()[1:]  # all dimensions except the batch dimension
            num_features = 1
            for s in size:
                num_features *= s
            return num_features

Later on, the following statement is made:

Let try a random 32x32 input. Note: expected input size of this net (LeNet) is 32x32. To use this net on MNIST dataset, please resize the images from the dataset to 32x32.

Question 1: Why do the images need to be 32x32 (where I assume that this means 32 pixels by 32)?

The first convolution applies six kernel to an image, with every kernel being 3x3. This means that if the input channel is 32x32, the six output channels all have dimensions 30x30 (the 3x3 kernel grid makes you lose 2 pixels in width and height). The second convolution applies more kernels so that there now are sixteen output channels of dimensions 28x28 (again the 3x3 kernel grid makes you lose 2 pixels in width and height). Now I would expect 16x28x28 nodes in the next layer, since every one of the sixteen output channels has 28x28 pixels. Somehow, this is incorrect, and the next layer contains 16x6x6 nodes. Why is this true?

Question 2: The second convolution layer goes from six input channels to sixteen output channels. How is this done?

In the first convolution layer we go from one input channel to six input channels, which makes sense to me. You can just apply six kernels to the single input channel to arrive at six output channels. Going from six input channels to sixteen output channels does not make as much sense to me. How are the different kernels applied? Do you apply two kernels to the first five input channels to arrive at ten output channels, and apply six kernels to the last input channel, so that the total comes to sixteen output channels? Or does the neural network learn itself to use x kernels and apply them to the input channels that it finds most suitable?

Mr. President Mr. President · Accepted Answer · 2019-09-07T11:52:34

I can now answer these questions my self.

Question 1: To see why you need a 32x32 image for this neural network to work consider the following:

Layer 1: First, convolution is applied with a 3x3 kernel. Since the image has dimensions 32x32, this will result in a grid of 30x30. Next, max pooling is applied to the grid, with a 2x2 kernel and stride of 2 resulting in a grid that has dimensions 15x15.

Layer 2: First, convolution is applied with a 3x3 kernel to the 15x15 grid, resulting in a 13x13 grid. Next, max pooling is applied with a 2x2 kernel and stride of 2 resulting in a grid that has dimensions 6x6. We get a 6x6 grid and not a 7x7 grid because by default the floor function is used and not the ceil function.

Since the convolution in layer 2 has sixteen output channels, the first linear layer needs 16x6x6 nodes! We see that the required input is indeed a 32x32 image.

Question 2: Every output channel is created by applying six different kernels to each input channel and summing the results. This is explained in the documentation.

Deep learning with Pytorch: understanding the neural network example

1 Answers