[This question is now also posed at Cross Validated]
The question in short
I'm studying convolutional neural networks, and I believe that these networks do not treat every input neuron (pixel/parameter) equivalently. Imagine we have a deep network (many layers) that applies convolution on some input image. The neurons in the "middle" of the image have many unique pathways to many deeper layer neurons, which means that a small variation in the middle neurons has a strong effect on the output. However, the neurons at the edge of the image have only 1 way (or, depending on the exact implementation, of the order of 1) pathways in which their information flows through the graph. It seems that these are "under-represented".
I am concerned about this, as this discrimination of edge neurons scales exponentially with the depth (number of layers) of the network. Even adding a max-pooling layer won't halt the exponential increase, only a full connection brings all neurons on equal footing. I'm not convinced that my reasoning is correct, though, so my questions are:
- Am I right that this effect takes place in deep convolutional networks?
- Is there any theory about this, has it ever been mentioned in literature?
- Are there ways to overcome this effect?
Because I'm not sure if this gives sufficient information, I'll elaborate a bit more about the problem statement, and why I believe this is a concern.
More detailed explanation
Imagine we have a deep neural network that takes an image as input. Assume we apply a convolutional filter of 64x64 pixel over the image, where we shift the convolution window by 4 pixels each time. This means that every neuron in the input sends it's activation to 16x16 = 265 neurons in layer 2. Each of these neurons might send their activation to another 265, such that our topmost neuron is represented in 265^2 output neurons, and so on. This is, however, not true for neurons on the edges: these might be represented in only a small number of convolution windows, thus causing them to activate (of the order of) only 1 neuron in the next layer. Using tricks such as mirroring along the edges won't help this: the second-layer-neurons that will be projected to are still at the edges, which means that that the second-layer-neurons will be underrepresented (thus limiting the importance of our edge neurons as well). As can be seen, this discrepancy scales exponentially with the number of layers.
I have created an image to visualize the problem, which can be found here (I'm not allowed to include images in the post itself). This network has a convolution window of size 3. The numbers next to neurons indicate the number of pathways down to the deepest neuron. The image is reminiscent of Pascal's Triangle.
https://www.dropbox.com/s/7rbwv7z14j4h0jr/deep_conv_problem_stackxchange.png?dl=0
Why is this a problem?
This effect doesn't seem to be a problem at first sight: In principle, the weights should automatically adjust in such a way that the network does it's job. Moreover, the edges of an image are not that important anyway in image recognition. This effect might not be noticeable in everyday image recognition tests, but it still concerns me because of two reasons: 1) generalization to other applications, and 2) problems arising in the case of very deep networks. 1) There might be other applications, like speech or sound recognition, where it is not true that the middle-most neurons are the most important. Applying convolution is often done in this field, but I haven't been able to find any papers that mention the effect that I'm concerned with. 2) Very deep networks will notice an exponentially bad effect of the discrimination of boundary neurons, which means that central neurons can be overrepresented by multiple order of magnitude (imagine we have 10 layers such that the above example would give 265^10 ways the central neurons can project their information). As one increases the number of layers, one is bound to hit a limit where weights cannot feasibly compensate for this effect. Now imagine we perturb all neurons by a small amount. The central neurons will cause the output to change more strongly by several orders of magnitude, compared to the edge neurons. I believe that for general applications, and for very deep networks, ways around my problem should be found?