0
votes

In the book Deep Learning with Python by François Chollet (creator of Keras), section 5.3 (see the companion Jupyter notebook), the following is unclear to me:

Let's put this in practice by using the convolutional base of the VGG16 network, trained on ImageNet, to extract interesting features from our cat and dog images, and then training a cat vs. dog classifier on top of these features.

[...]

There are two ways we could proceed:

  • Running the convolutional base over our dataset, recording its output to a Numpy array on disk, then using this data as input to a standalone densely-connected classifier similar to those you have seen in the first chapters of this book. This solution is very fast and cheap to run, because it only requires running the convolutional base once for every input image, and the convolutional base is by far the most expensive part of the pipeline. However, for the exact same reason, this technique would not allow us to leverage data augmentation at all.
  • Extending the model we have (conv_base) by adding Dense layers on top, and running the whole thing end-to-end on the input data. This allows us to use data augmentation, because every input image is going through the convolutional base every time it is seen by the model. However, for this same reason, this technique is far more expensive than the first one.

Why can't we augment our data (generate more images from the existing data), run the convolutional base over the augmented dataset (one time), record its output and then use this data as input to a standalone fully-connected classifier?

Wouldn't it give similar results to the second alternative but be faster?

What am I missing?

2

2 Answers

2
votes

Wouldn't it give similar results to the second alternative but be faster?

Similar results yes, but would it really be faster?

The main point of Chollet here is that the second way is more expensive simply due to the larger number of images caused by the augmentation procedure itself; while the first approach

only requires running the convolutional base once for every input image

in the second

every input image is going through the convolutional base every time it is seen by the model [...] for this same reason, this technique is far more expensive than the first one

since

the convolutional base is by far the most expensive part of the pipeline

where "every time it is seen by the model" must be understood as "in every version produced by the augmentation procedure" (agree, the wording could and should be clearer here...).

There is no walkaround from this using your proposed method. It's a valid alternative version of the second way, sure, but there is no reason to believe it will actually be faster, taking into account the whole end-to-end process (CNN+FC) in both cases...

UPDATE (after comment):

Maybe you are right, but I still have a feeling of missing something since the author explicitly wrote that the first method "would not allow us to leverage data augmentation at all".

I think you are just over-reading things here - although, again, the author arguably could and should be clearer; as written, Chollet's argument here is somewhat circular (it can happen to the best of us): since we run "the convolutional base [only] once for every input image", it turns out by definition that we don't use any augmentation... Interestingly enough, the phrase in the book (p. 146) is slightly different (less dramatic):

But for the same reason, this technique won’t allow you to use data augmentation.

And what is that reason? But of course that we feed each image to the convolutional base only once...

In other words, it's not in fact that we are not "allowed" to, but rather that we have chosen not to augment (in order to be faster, that is)...

0
votes

Looking at the VGG16 paper and interpreting a bit, I believe the difference is basically in how many times your base network is going to see the input images, and how it will treat them as a result.

According the paper, random scaling is performed on input images during training (scale jitter). If you place your new dense layers on top of the frozen base network and then run the whole stack through a training procedure (second approach), I suppose the assumption is that you would not be disabling the scale jitter mechanism in the base network; thus you would see (different) randomly-scaled versions of each input image each time through your training set (each epoch).

If you run the input images through your base network a single time (first approach), the base is essentially running in an evaluation mode, so it does not scale the input image at all, or do any other sort of image augmentation type of transformation. You could do so yourself to basically add the augmented input images to your newly-transformed dataset. I suppose the book is assuming that you won't do this.

Either way, you would likely end up training on multiple epochs (multiple times through the dataset) so the second approach would carry the added load of executing the whole base network for every training sample for every epoch, whereas the first approach would only require executing the base network once for each sample offline, and then just training on the pre-transformed samples.