1
votes

I am learning convolutional neural network and trying to figure out how the mathematical computation takes place. Suppose there is an input image that has 3 channels (RGB), so the shape of the image is 28*28*3. Consider 6 filters are applied of size 5*5*3 and stride 1 for the next layer. Thus, we will get 24*24*6 in the next layer. Since the input image is an RGB image, how is each filter's 24*24 image interpreted as an RGB image, i.e, does each filter's internally constructs image of size 24*24*3 ?

1

1 Answers

2
votes

After you've applied the first convolutional layer, you can't think of it as being RGB anymore. That [5, 5, 3] convolution takes all of the information from 5*5*3 = 75 floats (25 pixels, each with 3 channels) and mixes it together based upon whatever parameters the network has trained for that filter.

In many image recognition tasks, the first layer often learns things like edge detectors and sharpening masks, etc. For example, see this visualization of the layers of VGG16.

But the output itself is just... information, at that point. Or, to be more precise, the meaning of the depth channels is going to depend on how the network has learned. There will probably be meaningful things that differentiate the depth channels (and what the different values in there mean), but it's unlikely to be intuitive without trying to visualize it. I don't know of a project that's visualized the depth channels independently, but someone might have.