1
votes

I came across this nice article which gives an intuitive explanation of how convnets work.

Now trying to understand what is going on exactly inside a caffe conv layer:

With input data of shape 1 x 13 x 19 x 19, and 128 filters conv layer:

layers {
  name: "conv1_7x7_128"
  type: CONVOLUTION
  blobs_lr: 1.
  blobs_lr: 2.
  bottom: "data"
  top: "conv2"
  convolution_param {
    num_output: 128
    kernel_size: 7
    pad: 3
    weight_filler {
      type: "xavier"
      }
      bias_filler {
      type: "constant"
      }
    }
}

Layer output shape is 1 x 128 x 19 x 19 if i understand correctly.

Looking at the layer's weights' shapes in net->layers()[1]->blobs():

layer  1: type Convolution  'conv1_7x7_128'
  blob 0: 128 13 7 7
  blob 1: 128

Looks like blob 0 has all the weigths: one 7x7 matrix per plane (13) per filter (128).

Doing convolutions with blob 0 on 1 x 13 x 19 x 19 data, if i understand correctly we end up with 128 x 13 x 19 x 19 output (there's padding so each 7x7 matrix produces one number for each pixel)

  • How does 128 x 13 x 19 x 19 turn into layer's 1 x 128 x 19 x 19 output ?

  • What are the 128 weights in blob 1 ?

Bonus question: what is blobs_lr ?

1
a simple google of about the bonus question led me to Cf. MNISTtutorial for why people use 2 different strategy for learning rate: blobs_lr are the learning rate adjustments for the layer’s learnable parameters. In this case, we will set the weight learning rate to be the same as the learning rate given by the solver during runtime, and the bias learning rate to be twice as large as that - this usually leads to better convergence rates. github.com/BVLC/caffe/issues/913 - Eliethesaiyan

1 Answers

3
votes

You are quoting an older version of caffe's prototxt format. Adjusting to new format will give you

layer {  # layer and not layer*s*
  name: "conv1_7x7_128"
  type: "Convolution"  # tyoe as string
  param { lr_mult: 1. }  # instead of blobs_lr
  param { lr_mult: 2. }  # instead of blobs_lr 
  bottom: "data"
  top: "conv2"
  convolution_param {
    num_output: 128
    kernel_size: 7
    pad: 3
    weight_filler {
      type: "xavier"
      }
      bias_filler {
      type: "constant"
      }
    }
}

If you have input data of shape 1 x 13 x 19 x 19, means your batch_size is 1, you have 13 channels with spatial dimensions of 19 x 19.
Applying 128 filters of 7 x 7 (each filter is applied to all 13 input channels) means you have 128 filters of shape 13 x 7 x 7 (this is the shape of your first layer's parameter). Applying each filter results with a single output channel 1 x 1 x 19 x 19, since you have 128 such filters you end up with 1 x 128 x 19 x 19 output.

The second layer's parameter is the bias term - an additive scalar to the result of each filter. You can turn off the bias term by adding

bias_term: false

To the convolution_param of you layer.

You can read more about convolution layer here.

As for the bonus question, Eliethesaiyan already answered it well in his comment.