What does it mean to say convolution implementation is based on GEMM (matrix multiply) or it is based on 1x1 kernels?

Question

I have been trying to understand (but miserably failing) how convolutions on images (with height, width, channels) are implemented in software.

I've heard people say their convolution implementation is done using GEMM, or done using "Direct convolution" or done using 1x1 kernels.

I find it very confusing and can't wrap my head around so many different ways it's described everywhere - I thought I understood a typical convolution like pytorch conv2d as a mathematical operation on an image, but what do they mean when someone says they do conv2d using one of the following ways?

1x1 kernels or 1x1 convolution (what does kernel even mean here)
GEMM
"direct convolution"

For doing Convolution using GEMM, what I understand based on this paper is that each of the input-image and filters are converted to 2d matrices using im2col and im2row ops and then these two are simply matrix-multiplied.

The 3d input image (height, width, input-channels) is converted to a 2d matrix, the 4-d kernel (output-channels, input-channels, kernel-height, kernel-width) is converted to a 2d matrix. Or does "GEMM-based implementation of convolution" mean something else? If that's what it means then how is it different than doing "convolution using 1x1 kernels"?

Szymon Maszke Szymon Maszke · Accepted Answer · 2020-10-24T11:46:49

1x1 kernels or 1x1 convolution (what does kernel even mean here)

You can have 3x3 convolution, so you have a square containing 9 elements sliding over the image (with some specified stride, dilation etc.). In this case you have 1x1 convolution so the kernel is a single element (with stride=1 as well and no dilation).

So instead of sliding window with summation you simply project linearly each pixel with this single valued kernel.

It is a cheap operation and is used as part of depthwise separable convolutions used in many modern architectures to increase/decrease number of channels.

GEMM

In the article you provided there is as the top:

[...] function called GEMM. It’s part of the BLAS (Basic Linear Algebra Subprograms)

So BLAS is a specification which describes a set of low-level algebraic operations and how they should be performed on computer.

Now, you have a lot of implementations of BLAS tailored to specific architectures or having some traits usable in some context. For example there is cuBLAS which is written and optimized for GPU (and used heavily by deep learning "higher level" libraries like PyTorch) or Intel's MKL for Intel CPUs (you can read more about BLAS anywhere on the web)

Usually those are written with a low-level (Fortran, C, Assembly, C++) languages for maximum performance.

GEMM is GEneralized Matrix multiplication routine which is used to implement fully connected layers and convolutions and is provided by various BLAS implementations. It has nothing to do with the deep learning convolution per-se, it is a fast matrix multiplication routine (considering things like cache hit)

Direct convolutions

It is an approach which is O(n^2) complexity so you simply multiply items with each other. There is more efficient approach using Fast Fourier Transformation which is O(n*log(n)). Some info presented in this answer and questions about this part would be better suited for math related stackexchanges.

What does it mean to say convolution implementation is based on GEMM (matrix multiply) or it is based on 1x1 kernels?

1 Answers

GEMM

Direct convolutions