Why VGG-16 takes input size 512 * 7 * 7?

Question

According to https://github.com/pytorch/vision/blob/master/torchvision/models/vgg.py

I don`t understand why VGG models take 512 * 7 * 7 input_size of fully-connected layer. Last convolution layer is

nn.Conv2d(512, 512, kernel_size=3, padding=1),
nn.ReLU(True),
nn.MaxPool2d(kernel_size=2, stride=2, dilation=1)

Codes in above link.

class VGG(nn.Module):

    def __init__(self, features, num_classes=1000, init_weights=True):
        super(VGG, self).__init__()
        self.features = features
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )

Manuel Lagunas Manuel Lagunas · Accepted Answer · 2018-02-13T11:36:29

To understand this you have to know how the convolution operator works for CNNs. nn.Conv2d(512, 512, kernel_size=3, padding=1) means that the input image to that convolution has 512 channels and that the output after the convolution is gonna be also 512 channels. The input image is going to be convolved with a kernel of size 3x3 that moves as a sliding window. Finally, the padding=1 means that before applying the convolution, we symmetrically add zeroes to the edges of the input matrix.

In the example you are saying, you can think that 512 is the depth while 7x7 is the width and height that is obtained by applying several convolutions. Imagine that we have an image with some width and height and we feed it to a convolution, the resulting size will be

owidth  = floor(((width  + 2*padW - kW) / dW) + 1) 
oheight = floor(((height + 2*padH - kH) / dH) + 1)

where height and width are the original sizes, padW and padH are height and width (horizontal and vertical) padding, kW and kH are the kernel sizes and dW and dH are the width and height (horizontal and vertical) pixels that the kernel moves (i.e. if it is dW=1 first the kernel will be at pixel (0,0) and then move to (1,0) )

Usually the first convolution operator in a CNN looks like: nn.Conv2d(3, D, kernel_size=3, padding=1) because the original image has 3 input channels (RGB). Assuming that the input image has a size of 256x256x3 pixels if we apply the operator as defined before, the resulting image has the same width and height as the input image but its depth is now D. Simarly if we define the convolution as c = nn.Conv2d(3, 15, kernel_size=25, padding=0, stride=5) with kernel_size=25, no padding in the input image and with stride=5 (dW=dH=5, which means that the kernel moves 5 pixels each time if we are at (0,0) then it moves to (5,0), until we reach the end of the image on the x-axis then it moves to (0,5) -> (5,5) -> (5,15) until it reaches the end again) the resulting output image will have a size of 47x47xD

Why VGG-16 takes input size 512 * 7 * 7?

2 Answers