3
votes

I have a little comprehension problem with CNN. And I'm not quite sure how many filters and thus weights are trained.

Example: I have an input layer with the 32x32 pixels and 3 channels (i.e. shape of (32,32,3)). Now I use a 2D-convolution layer with 10 filters of shape (4,4). So I end up with 10 channels each with shape of (28,28), but do I now train a separate filter for each input channel or are they shared? Do I train 3x10x4x4 weights or do I train 10x4x4 weights?

1

1 Answers

5
votes

You can find out the number of (non-)trainable parameters of a model in Keras using the summary function:

from keras import models, layers

model = models.Sequential()
model.add(layers.Conv2D(10, (4,4), input_shape=(32, 32, 3)))

model.summary()

Here is the output:

Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 29, 29, 10)        490       
=================================================================
Total params: 490
Trainable params: 490
Non-trainable params: 0

In general, for a 2D-convolution layer with k filters with size of w*w applied on an input with c channels the number of trainable parameters (considering one bias parameter for each filter, in the default case) is equal to k*w*w*c+k or k*(w*w*c+1). In the example above we have: k=10, w=4, c=3 therefore we have 10*(4*4*3+1) = 490 trainable parameters. As you can infer, for each channel there are separate weights and they are not shared. Furthermore, the number of parameters of a 2D-convolution layer does not depend on the width or height of the previous layer.

Update:

A convolution layer with depth-wise shared weights: I am not aware of such a layer and could not find a built-in implementation of that in Keras or Tensorflow either. But after thinking about it, you realize that it is essentially equivalent to summing all the channels together and then applying a 2D-convolution on the result. For example in case of a 32*32*3 image, first all the three channels are summed together resulting in a 32*32*1 tensor and then a 2D-convolution can be applied on that tensor. Therefore at least one way of achieving a 2D-convolution with shared weights across channels could be like this in Keras (which may or may not be efficient):

from keras import models, layers
from keras import backend as K

model = models.Sequential()
model.add(layers.Lambda(lambda x: K.expand_dims(K.sum(x, axis=-1)), input_shape=(32, 32, 3)))
model.add(layers.Conv2D(10, (4,4)))

model.summary()

Output:

Layer (type)                 Output Shape              Param #   
=================================================================
lambda_1 (Lambda)            (None, 32, 32, 1)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 29, 29, 10)        170       
=================================================================
Total params: 170
Trainable params: 170
Non-trainable params: 0

One good thing about that Lambda layer is that it could be added in any place (e.g. after the convolution layer). But I think the most important question to ask here is: "Why using a 2D-conv layer with depth-wise shared weighs would be beneficial?" One obvious answer is that the network size (i.e. the total number of trainable parameters) is reduced and therefore there might be a decrease in training time, which I suspect would be negligible. Further, using shared weights across channels implies that the patterns present in different channels are more or less similar. But this is not always the case, for example in RGB images, and therefore by using shared weights across channels I guess you might observe a (noticeable) decrease in network accuracy. So, at least, you should have in mind this trade-off and experiment with it.

However, there is another kind of convolution layer, which you might be interested in, called "Depth-wise Separable Convolution" which has been implemented in Tensorflow, and Keras supports it as well. The idea is that on each channel a separate 2D-conv filter is applied and afterwards the resulting feature maps are aggregated using k 1*1 convolutions(k here is the number of output channels). It basically separates the learning of spatial features and depth-wise features. In his paper, "Xception: Deep Learning with Depthwise Separable Convolutions", Francois Chollet (the creator of Keras) shows that using depth-wise separable convolutions improves both the performance and accuracy of network. And here you can read more about different kinds of convolution layers used in deep learning.