extracting CNN features from middle layers

Question

I want to test the performance of each convolutional layer of my Convolutional Neural Network(CNN) architecture using SVM. I am using MatConvNet Matlab toolbox.

Layers are like that: Conv1 Relu1 Pool1 (3x3, 32 features) -> Conv2 Relu2 Pool2 (3x3, 64 features) -> Conv3 Relu3 Pool3 (3x3, 128 features) ->Conv4 Relu4 (1x1, 256 features) -> Conv5 (1x1, 4 features)-> Softwamxloss

After training, I removed the loss layer

net.layers=net.layers(1 : end - 1);

I have the network looks like that

I can extract the features like that:

feats = vl_simplenn(net, im) ;
Feautre_L1(fea,:) = squeeze(feats(end).x);

similarly, I remove 2 more layers and extract 256 features from Conv4. But when I moved to Conv3 the output feature is 7x7x128. I want to know that how can I used these features
i) Making a single vector ii) Averaging the values in depth?

Make a single vector with 7x7x128 = 6727 dimensions. If you stumble into other layer and the final vector would be too big, you might want to use some dimensionality reduction, such as PCA. — rafaspadilha
does it matters each 7x7 to make single vector by column-wise or row-wise? — Addee
"If you stumble into other layer " you mean combine features of two layers? — Addee
I see. You may want to increase the number of layers of your network as a way of increasing its learning capacity. Your network is fairly simple, considering what is being used nowadays. — rafaspadilha
If you put these things in the answer so that I can accept that — Addee

rafaspadilha rafaspadilha · Accepted Answer · 2018-03-13T18:47:17

Transforming a map into a feature vector:

In your case, you could turn the 7x7x128 map into an array with 6727 dimensions. In case you have an intermediate map, whose flattened array has a high-dimentionality (e.g., 100k-d), you may want to reduce its dimensions (through PCA, for example) because it probably contains a lot of redundancy in its features.

Combining the output of your layers:

As for combining the features, there are many ways of doing it. An option that you've mentioned is to create a single vector concatenating the desired layers' outputs and train a classifier on top of that. This is known as early fusion [1], where you combine your features for a single representation before training.

Another possibility is to train a separate classifier for each feature (the output of each intermediate layer, in your case) and then, for a testing image, you combine the outputs/scores of those separate classifiers. This is known as late fusion [1].

Extras:

An exploration you could perform is to investigate which layers you should select (either for early- or late-fusion) prior to training your SVM. This [2] is an interesting paper, where the authors explore something similar (analyzing the performance when using the output from the last few layers as features separately). As far as I remember, their investigation is within the context of transfer learning (using a model pre-trained in a similar problem to tackle/solve another task).

[1] "Early versus Late Fusion in Semantic Video Analysis"

[2] "CNN Features off-the-shelf: an Astounding Baseline for Recognition"

extracting CNN features from middle layers

1 Answers

Transforming a map into a feature vector:

Combining the output of your layers:

Extras: