7
votes

I don't understand why there is the need to flip filters when using convolutional neural networks.

According to the lasagne documentation,

flip_filters : bool (default: True)

Whether to flip the filters before sliding them over the input, performing a convolution (this is the default), or not to flip them and perform a correlation. Note that for some other convolutional layers in Lasagne, flipping incurs an overhead and is disabled by default – check the documentation when using learned weights from another layer.

What does that mean? I never read about flipping filters when convolving in any neural network book. Would someone clarify, please?

2
It's some sort of correction for different layer types, see github.com/Lasagne/Recipes/issues/39frozen

2 Answers

13
votes

The underlying reason for transposing a convolutional filter is the definition of the convolution operation - which is a result of signal processing. When performing the convolution, you want the kernel to be flipped with respect to the axis along which you're performing the convolution because if you don't, you end up computing a correlation of a signal with itself. It's a bit easier to understand if you think about applying a 1D convolution to a time series in which the function in question changes very sharply - you don't want your convolution to be skewed by, or correlated with, your signal.

This answer from the digital signal processing stack exchange site gives an excellent explanation that walks through the mathematics of why convolutional filters are defined to go in the reverse direction of the signal.

This page walks through a detailed example where the flip is done. This is a particular type of filter used for edge detection called a Sobel filter. It doesn't explain why the flip is done, but is nice because it gives you a worked-out example in 2D.

I mentioned that it is a bit easier to understand the why (as in, why is convolution defined this way) in the 1D case (the answer from the DSP SE site is really a great explanation); but this convention does apply to 2D and 3D as well (the Conv2DDNN anad Conv3DDNN layers both have the flip_filter option). Ultimately, however, because the convolutional filter weights are not something that the human programs, but rather are "learned" by the network, it is entirely arbitrary - unless you are loading weights from another network, in which case you must be consistent with the definition of convolution in that network. If convolution was defined correctly (i.e., according to convention), the filter will be flipped. If it was defined incorrectly (in the more "naive" and "lazy" way), it will not.

The broader field that convolutions are a part of is "linear systems theory" so searching for this term might turn up more about this, albeit outside the context of neural networks.

Note that the convolution/correlation distinction is also mentioned in the docstrings of the corrmm.py class in lasagne:

flip_filters : bool (default: False) Whether to flip the filters and perform a convolution, or not to flip them and perform a correlation. Flipping adds a bit of overhead, so it is disabled by default. In most cases this does not make a difference anyway because the filters are learnt. However, flip_filters should be set to True if weights are loaded into it that were learnt using a regular :class:lasagne.layers.Conv2DLayer, for example.

1
votes

Firstly, since CNNs are trained from scratch instead of human-designed, if the flip operation is necessary, the learned filters would be the flipped one and the cross-correlation with the flipped filters is implemented. Secondly, flipping is neccessary in 1D time-series processing, since the past inputs impact the current system output given the "current" input. But in 2D/3D image spatial convolution, there is not "time" concept, then not "past" input and its impact on "now", therefore, we don't need to consider the relationship of "signal" and "system", and there is only the relationship of "signal"(image patch) and "signal"(image patch), which means we only need cross-correlation instead of convolution (although DL borrow this concept from signal processing). Therefore, the flip operation is actually not needed. (I guess.)