2
votes

I want to test the performance of each convolutional layer of my Convolutional Neural Network(CNN) architecture using SVM. I am using MatConvNet Matlab toolbox.

Layers are like that: Conv1 Relu1 Pool1 (3x3, 32 features) -> Conv2 Relu2 Pool2 (3x3, 64 features) -> Conv3 Relu3 Pool3 (3x3, 128 features) ->Conv4 Relu4 (1x1, 256 features) -> Conv5 (1x1, 4 features)-> Softwamxloss

After training, I removed the loss layer

net.layers=net.layers(1 : end - 1);

I have the network looks like that enter image description here

I can extract the features like that:

feats = vl_simplenn(net, im) ;
Feautre_L1(fea,:) = squeeze(feats(end).x);

similarly, I remove 2 more layers and extract 256 features from Conv4. But when I moved to Conv3 the output feature is 7x7x128. I want to know that how can I used these features
i) Making a single vector ii) Averaging the values in depth?

1
Make a single vector with 7x7x128 = 6727 dimensions. If you stumble into other layer and the final vector would be too big, you might want to use some dimensionality reduction, such as PCA.rafaspadilha
does it matters each 7x7 to make single vector by column-wise or row-wise?Addee
"If you stumble into other layer " you mean combine features of two layers?Addee
I see. You may want to increase the number of layers of your network as a way of increasing its learning capacity. Your network is fairly simple, considering what is being used nowadays.rafaspadilha
If you put these things in the answer so that I can accept thatAddee

1 Answers

1
votes

Transforming a map into a feature vector:

In your case, you could turn the 7x7x128 map into an array with 6727 dimensions. In case you have an intermediate map, whose flattened array has a high-dimentionality (e.g., 100k-d), you may want to reduce its dimensions (through PCA, for example) because it probably contains a lot of redundancy in its features.

Combining the output of your layers:

As for combining the features, there are many ways of doing it. An option that you've mentioned is to create a single vector concatenating the desired layers' outputs and train a classifier on top of that. This is known as early fusion [1], where you combine your features for a single representation before training.

Another possibility is to train a separate classifier for each feature (the output of each intermediate layer, in your case) and then, for a testing image, you combine the outputs/scores of those separate classifiers. This is known as late fusion [1].

Extras:

An exploration you could perform is to investigate which layers you should select (either for early- or late-fusion) prior to training your SVM. This [2] is an interesting paper, where the authors explore something similar (analyzing the performance when using the output from the last few layers as features separately). As far as I remember, their investigation is within the context of transfer learning (using a model pre-trained in a similar problem to tackle/solve another task).

[1] "Early versus Late Fusion in Semantic Video Analysis"

[2] "CNN Features off-the-shelf: an Astounding Baseline for Recognition"