I'm reading the Pytorch documentation and I have a couple of questions about the neural network that is introduced. The documentation defines the following network:
import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# 1 input image channel, 6 output channels, 3x3 square convolution
# kernel
self.conv1 = nn.Conv2d(1, 6, 3)
self.conv2 = nn.Conv2d(6, 16, 3)
# an affine operation: y = Wx + b
self.fc1 = nn.Linear(16 * 6 * 6, 120) # 6*6 from image dimension
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
# Max pooling over a (2, 2) window
x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
# If the size is a square you can only specify a single number
x = F.max_pool2d(F.relu(self.conv2(x)), 2)
x = x.view(-1, self.num_flat_features(x))
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
def num_flat_features(self, x):
size = x.size()[1:] # all dimensions except the batch dimension
num_features = 1
for s in size:
num_features *= s
return num_features
Later on, the following statement is made:
Let try a random 32x32 input. Note: expected input size of this net (LeNet) is 32x32. To use this net on MNIST dataset, please resize the images from the dataset to 32x32.
Question 1: Why do the images need to be 32x32 (where I assume that this means 32 pixels by 32)?
The first convolution applies six kernel to an image, with every kernel being 3x3. This means that if the input channel is 32x32, the six output channels all have dimensions 30x30 (the 3x3 kernel grid makes you lose 2 pixels in width and height). The second convolution applies more kernels so that there now are sixteen output channels of dimensions 28x28 (again the 3x3 kernel grid makes you lose 2 pixels in width and height). Now I would expect 16x28x28 nodes in the next layer, since every one of the sixteen output channels has 28x28 pixels. Somehow, this is incorrect, and the next layer contains 16x6x6 nodes. Why is this true?
Question 2: The second convolution layer goes from six input channels to sixteen output channels. How is this done?
In the first convolution layer we go from one input channel to six input channels, which makes sense to me. You can just apply six kernels to the single input channel to arrive at six output channels. Going from six input channels to sixteen output channels does not make as much sense to me. How are the different kernels applied? Do you apply two kernels to the first five input channels to arrive at ten output channels, and apply six kernels to the last input channel, so that the total comes to sixteen output channels? Or does the neural network learn itself to use x kernels and apply them to the input channels that it finds most suitable?