Understanding feature extraction using a pretrained convolutional neural network

Question

In the book Deep Learning with Python by François Chollet (creator of Keras), section 5.3 (see the companion Jupyter notebook), the following is unclear to me:

Let's put this in practice by using the convolutional base of the VGG16 network, trained on ImageNet, to extract interesting features from our cat and dog images, and then training a cat vs. dog classifier on top of these features.

[...]

There are two ways we could proceed:

Running the convolutional base over our dataset, recording its output to a Numpy array on disk, then using this data as input to a standalone densely-connected classifier similar to those you have seen in the first chapters of this book. This solution is very fast and cheap to run, because it only requires running the convolutional base once for every input image, and the convolutional base is by far the most expensive part of the pipeline. However, for the exact same reason, this technique would not allow us to leverage data augmentation at all.

Extending the model we have (conv_base) by adding Dense layers on top, and running the whole thing end-to-end on the input data. This allows us to use data augmentation, because every input image is going through the convolutional base every time it is seen by the model. However, for this same reason, this technique is far more expensive than the first one.

Why can't we augment our data (generate more images from the existing data), run the convolutional base over the augmented dataset (one time), record its output and then use this data as input to a standalone fully-connected classifier?

Wouldn't it give similar results to the second alternative but be faster?

What am I missing?

desertnaut desertnaut · Accepted Answer · 2018-08-10T10:08:34

Wouldn't it give similar results to the second alternative but be faster?

Similar results yes, but would it really be faster?

The main point of Chollet here is that the second way is more expensive simply due to the larger number of images caused by the augmentation procedure itself; while the first approach

only requires running the convolutional base once for every input image

in the second

every input image is going through the convolutional base every time it is seen by the model [...] for this same reason, this technique is far more expensive than the first one

since

the convolutional base is by far the most expensive part of the pipeline

where "every time it is seen by the model" must be understood as "in every version produced by the augmentation procedure" (agree, the wording could and should be clearer here...).

There is no walkaround from this using your proposed method. It's a valid alternative version of the second way, sure, but there is no reason to believe it will actually be faster, taking into account the whole end-to-end process (CNN+FC) in both cases...

UPDATE (after comment):

Maybe you are right, but I still have a feeling of missing something since the author explicitly wrote that the first method "would not allow us to leverage data augmentation at all".

I think you are just over-reading things here - although, again, the author arguably could and should be clearer; as written, Chollet's argument here is somewhat circular (it can happen to the best of us): since we run "the convolutional base [only] once for every input image", it turns out by definition that we don't use any augmentation... Interestingly enough, the phrase in the book (p. 146) is slightly different (less dramatic):

But for the same reason, this technique won’t allow you to use data augmentation.

And what is that reason? But of course that we feed each image to the convolutional base only once...

In other words, it's not in fact that we are not "allowed" to, but rather that we have chosen not to augment (in order to be faster, that is)...

Understanding feature extraction using a pretrained convolutional neural network

2 Answers