4
votes

[This question is now also posed at Cross Validated]

The question in short

I'm studying convolutional neural networks, and I believe that these networks do not treat every input neuron (pixel/parameter) equivalently. Imagine we have a deep network (many layers) that applies convolution on some input image. The neurons in the "middle" of the image have many unique pathways to many deeper layer neurons, which means that a small variation in the middle neurons has a strong effect on the output. However, the neurons at the edge of the image have only 1 way (or, depending on the exact implementation, of the order of 1) pathways in which their information flows through the graph. It seems that these are "under-represented".

I am concerned about this, as this discrimination of edge neurons scales exponentially with the depth (number of layers) of the network. Even adding a max-pooling layer won't halt the exponential increase, only a full connection brings all neurons on equal footing. I'm not convinced that my reasoning is correct, though, so my questions are:

  • Am I right that this effect takes place in deep convolutional networks?
  • Is there any theory about this, has it ever been mentioned in literature?
  • Are there ways to overcome this effect?

Because I'm not sure if this gives sufficient information, I'll elaborate a bit more about the problem statement, and why I believe this is a concern.

More detailed explanation

Imagine we have a deep neural network that takes an image as input. Assume we apply a convolutional filter of 64x64 pixel over the image, where we shift the convolution window by 4 pixels each time. This means that every neuron in the input sends it's activation to 16x16 = 265 neurons in layer 2. Each of these neurons might send their activation to another 265, such that our topmost neuron is represented in 265^2 output neurons, and so on. This is, however, not true for neurons on the edges: these might be represented in only a small number of convolution windows, thus causing them to activate (of the order of) only 1 neuron in the next layer. Using tricks such as mirroring along the edges won't help this: the second-layer-neurons that will be projected to are still at the edges, which means that that the second-layer-neurons will be underrepresented (thus limiting the importance of our edge neurons as well). As can be seen, this discrepancy scales exponentially with the number of layers.

I have created an image to visualize the problem, which can be found here (I'm not allowed to include images in the post itself). This network has a convolution window of size 3. The numbers next to neurons indicate the number of pathways down to the deepest neuron. The image is reminiscent of Pascal's Triangle.

https://www.dropbox.com/s/7rbwv7z14j4h0jr/deep_conv_problem_stackxchange.png?dl=0

Why is this a problem?

This effect doesn't seem to be a problem at first sight: In principle, the weights should automatically adjust in such a way that the network does it's job. Moreover, the edges of an image are not that important anyway in image recognition. This effect might not be noticeable in everyday image recognition tests, but it still concerns me because of two reasons: 1) generalization to other applications, and 2) problems arising in the case of very deep networks. 1) There might be other applications, like speech or sound recognition, where it is not true that the middle-most neurons are the most important. Applying convolution is often done in this field, but I haven't been able to find any papers that mention the effect that I'm concerned with. 2) Very deep networks will notice an exponentially bad effect of the discrimination of boundary neurons, which means that central neurons can be overrepresented by multiple order of magnitude (imagine we have 10 layers such that the above example would give 265^10 ways the central neurons can project their information). As one increases the number of layers, one is bound to hit a limit where weights cannot feasibly compensate for this effect. Now imagine we perturb all neurons by a small amount. The central neurons will cause the output to change more strongly by several orders of magnitude, compared to the edge neurons. I believe that for general applications, and for very deep networks, ways around my problem should be found?

1
This isn't really a programming question and so might be better suited for Cross Validated. (See also this meta post for some discussion.)lmjohns3
Thanks for the tip! I decided to wait for a week-ish and just posed the question on Cross Validated.Koen
HEY ! Does someone think it could be the answer of my question ?? MY QUESTION : stackoverflow.com/questions/52206265/…Tbertin

1 Answers

0
votes

I will quote your sentences and below I will write my answers.

  • Am I right that this effect takes place in deep convolution networks

    • I think you are wrong in general but right according to your 64 by 64 sized convolution filter example. While you are structuring your convolution layer filter sizes, they would never be bigger than what you are looking for in your images. In other words - if your images are 200by200 and you convolve for 64by64 patches, you say that these 64by64 patches will learn some parts or exactly that image patch that identifies your category. The idea in the first layer is to learn edge-like partial important images not the entire cat or car itself.
  • Is there any theory about this, has it ever been mentioned in literature? and Are there ways to overcome this effect?

    • I never saw it in any paper I have looked through so far. And I do not think that this would be an issue even for very deep networks.

    • There is no such effect. Suppose your first layer which learned 64by64 patches is in action. If there is a patch in the top-left-most corner that would get fired(become active) then it will show up as a 1 in the next layers topmost left corner hence the information will be propagated through the network.

  • (not quoted) You should not think as 'a pixel is being useful in more neurons when it gets closer to center'. Think about 64x64 filter with a stride of 4:

    • if the pattern that your 64x64 filter look for is in the top-most-left corner of the image then it will get propagated to the next layers top most corner, otherwise there will be nothing in the next layer.

    • the idea is to keep meaningful parts of the image alive while suppressing the non-meaningful, dull parts, and combining these meaningful parts in following layers. In case of learning "an uppercase letter a-A" please look at only the images in the very old paper of Fukushima 1980 (http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf) figure 7 and 5. Hence there is no importance of a pixel, there is importance of image patch which is the size of your convolution layer.

  • The central neurons will cause the output to change more strongly by several orders of magnitude, compared to the edge neurons. I believe that for general applications, and for very deep networks, ways around my problem should be found?

    • Suppose you are looking for a car in an image,

    • And suppose that in your 1st example the car is definitely in the 64by64 top-left-most part of your 200by200 image, in 2nd example the car is definitely in the 64by64 bottom-right-most part of your 200by200 image

    • In the second layer all your pixel values will be almost 0, for 1st image except the one in the very top-left-most corner and for 2nd image except the one in the very bottom-right-most corner.

    • Now, the center part of the image will mean nothing to my forward and backward propagation because the values will already be 0. But the corner values will never be discarded and will effect my learning weights.