I have been trying to understand (but miserably failing) how convolutions on images (with height, width, channels) are implemented in software.
I've heard people say their convolution implementation is done using GEMM, or done using "Direct convolution" or done using 1x1 kernels.
I find it very confusing and can't wrap my head around so many different ways it's described everywhere - I thought I understood a typical convolution like pytorch conv2d as a mathematical operation on an image, but what do they mean when someone says they do conv2d using one of the following ways?
- 1x1 kernels or 1x1 convolution (what does kernel even mean here)
- GEMM
- "direct convolution"
For doing Convolution using GEMM, what I understand based on this paper is that each of the input-image and filters are converted to 2d matrices using im2col
and im2row
ops and then these two are simply matrix-multiplied.
The 3d input image (height, width, input-channels) is converted to a 2d matrix, the 4-d kernel (output-channels, input-channels, kernel-height, kernel-width) is converted to a 2d matrix. Or does "GEMM-based implementation of convolution" mean something else? If that's what it means then how is it different than doing "convolution using 1x1 kernels"?