1
votes

According to https://github.com/pytorch/vision/blob/master/torchvision/models/vgg.py

I don`t understand why VGG models take 512 * 7 * 7 input_size of fully-connected layer. Last convolution layer is

  • nn.Conv2d(512, 512, kernel_size=3, padding=1),
  • nn.ReLU(True),
  • nn.MaxPool2d(kernel_size=2, stride=2, dilation=1)

Codes in above link.

class VGG(nn.Module):

    def __init__(self, features, num_classes=1000, init_weights=True):
        super(VGG, self).__init__()
        self.features = features
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )
2

2 Answers

3
votes

To understand this you have to know how the convolution operator works for CNNs. nn.Conv2d(512, 512, kernel_size=3, padding=1) means that the input image to that convolution has 512 channels and that the output after the convolution is gonna be also 512 channels. The input image is going to be convolved with a kernel of size 3x3 that moves as a sliding window. Finally, the padding=1 means that before applying the convolution, we symmetrically add zeroes to the edges of the input matrix.

In the example you are saying, you can think that 512 is the depth while 7x7 is the width and height that is obtained by applying several convolutions. Imagine that we have an image with some width and height and we feed it to a convolution, the resulting size will be

owidth  = floor(((width  + 2*padW - kW) / dW) + 1) 
oheight = floor(((height + 2*padH - kH) / dH) + 1)

where height and width are the original sizes, padW and padH are height and width (horizontal and vertical) padding, kW and kH are the kernel sizes and dW and dH are the width and height (horizontal and vertical) pixels that the kernel moves (i.e. if it is dW=1 first the kernel will be at pixel (0,0) and then move to (1,0) )

Usually the first convolution operator in a CNN looks like: nn.Conv2d(3, D, kernel_size=3, padding=1) because the original image has 3 input channels (RGB). Assuming that the input image has a size of 256x256x3 pixels if we apply the operator as defined before, the resulting image has the same width and height as the input image but its depth is now D. Simarly if we define the convolution as c = nn.Conv2d(3, 15, kernel_size=25, padding=0, stride=5) with kernel_size=25, no padding in the input image and with stride=5 (dW=dH=5, which means that the kernel moves 5 pixels each time if we are at (0,0) then it moves to (5,0), until we reach the end of the image on the x-axis then it moves to (0,5) -> (5,5) -> (5,15) until it reaches the end again) the resulting output image will have a size of 47x47xD

2
votes

The VGG neural net has two sections of layers: the "feature" layer and the "classifier" layer. The input to the feature layer is always an image of size 224 x 224 pixels.

The feature layer has 5 nn.MaxPool2d(kernel_size=2, stride=2) convolutions. See referenced source code line 76: each 'M' character in the configurations sets up one MaxPool2d convolution.

A MaxPool2d convolution with these specific parameters reduces the tensor size in half. So we have 224 --> 112 --> 56 --> 28 --> 14 --> 7 which means that the output of the feature layer is a 512 channels * 7 * 7 tensor. This is the input to the "classifier" layer.