0
votes

I have studied ordinary fully connected ANNs, I am starting to study convnets. I am struggling to understand how hidden layers connect. I do understand how the input matrix forward feeds a smaller field of values to the feature maps in the first hidden layer, by moving the local receptive field along one each time and forward feeding through the same/shared weights (for each feature map), so there are only one group of weights per feature map that are of the same structure as the local receptive field. Please correct me if I am wrong. Then, the feature maps use pooling to simplify the maps. The next part is when I get confused, here is a link to a 3d CNN visualisation to help explain my confusion

http://scs.ryerson.ca/~aharley/vis/conv/

Draw a digit between 0-9 into the top left pad and you'll see how it works. Its really cool. So, on the layer after the first pooling layer (the 4th row up containing 16 filters) if yoau hover your mouse over the filters you can see how the weights connect to the previous pooling layer. Try different filters on this row and what I do not understand is the rule that connects the second convolution layer to the previous pool layer. E.g on the filters to the very left, they are fully connected to the pooling layer. But on the ones nearer to the right, they only connect to about 3 of the previous pooled layers. Looks random.

I hope my explanation makes sense. I am essentially confused about what the pattern is that connects hidden pooled layers to the following hidden convolution layer. Even if my example is a bit odd, I would still appreciate some sort of explanation or link to a good explanation.

Thanks a lot.

1
stats.stackexchange.com may be better suited for this question - Sentry

1 Answers

1
votes

Welcome to the magic of self-trained CNNs. It's confusing because the network makes up these rules as it trains. This is an image-processing example; most of these happen to train in a fashion that loosely parallels the learning in a simplified model of the visual cortex in vertebrates.

In general, the first layer's kernels "learn" to "recognize" very simple features of the input: lines and edges in various orientations. The next layer combines those for more complex features, perhaps a left-facing half-circle, or a particular angle orientation. The deeper you go in the model, the more complex the "decisions" get, and the kernels get more complex, and/or less recognizable.

The difference in connectivity from left to right may be an intentional sorting by the developer, or mere circumstance in the model. Some features need to "consult" only a handful of the previous layer's kernels; others need a committee of the whole. Note how simple features connect to relatively few kernels, while the final decision has each of the ten categories checking in with a large proportion of the "pixel"-level units in the last FC layer.

You might look around for some kernel visualizations for larger CNN implementations, such as those in the ILSVRC: GoogleNet, ResNet, VGG, etc. Those have some striking kernels through the layers, including fuzzy matches to a wheel & fender, the front part of a standing mammal, various types of faces, etc.

Does that help at all?

All of this is the result of organic growth over the training period.