Best approach for convolution of multiple small matrices using CUDA

Question

I need to preform multiple convolutions with small matrices and kernels, and I was hoping that utilizing the many processors of the GPU would enable me to it as fast as possible.

The problem is as follows: I have many matrices (~1,000 to ~10,000) or relatively small sizes (~15x15 down to 1x1 - as in scalar), and a certain number of convolution masks (~20 to 1). I need to convolve all the matrices with each convolution mask example:

A; %5,000 matrices of size 10x10, A(i) = a 10x10 matrix
B; 10 matrices of size 5x5, B(k) = a 5x5 matrix
res(j)=conv(A,B(1)); %res(j) is the result of convolving all 5,000
%matrices in A by the j'th kernel B(j)

the goal is computing res(1),...,res(10) as quickly as possible

I would like to hear suggestions about how to implement the most efficient algorithm. FFT based convolution would probably be too slow.

Every implementation I've seen so far is for 2d convolution, meant to convolve 2 large matrices, while I need to convolve many small matrices.

I know very little about CUDA programming right now, but I'm in the process of learning.

I was hoping to figure this out myself, but due to time constraints, I am forced to ask for any advice anyone with experience can give me, while I learn how to code in CUDA.

Thank you!

p.s. any pointers to an implementation that suits my purposes is more than appreciated. I am a university students, and this is for a small research project, so nothing I need to pay for please...

This problem is not ideal for the GPU due to the small size of the matrices. From experience implementing batched solvers for small matrices on the GPU I would recommend using one thread block per matrix for the larger matrices, and one thread per matrix for the really small matrices. You would have to find the switchover point between the two approaches experimentally, it is likely between dimension 7 and dimension 10. — njuffa
Thank you. I thought almost no one needed this kind of thing, but I'm glad to see at least someone implemented this kind of thing. Do you happen to know where I can find a very fast CUDA implementation for such a problem? I've looked and couldn't find anything, but If there's a very good implementation out there, It would be great. I don't expect my code would be as fast as that of more experienced CUDA programmers out there (and right now, I have almost no knowledge of this topic, so anyone would be more experienced than me...) — user1999728
Batched solver and matrix inverse code is available for download from NVIDIA's registered developer website. There is no batched convolution code that I am aware of, I was simply outlining a possible partitioning based on the similarity in terms of size and number of matrices. Since the convolution work is in the context of a small student research project, it seems here is a good chance to gain experience by implementing this functionality yourself. — njuffa
I would, but it's not the point of the project. Convolution just takes a huge amount of time in my code. I was supposed to use a highly optimized code written by someone else, but that fell through, and I don't have much time... — user1999728

Vitality Vitality · Accepted Answer · 2013-07-31T09:47:05

I do not pretend to give an ultimate answer to your question, but I would just like to point out a couple of things:

As you mentioned, a first possibility would be to use the FFT approach. A problem on this line is that (correct me if I'm wrong) the cuFFT library is primarily designed to cope with large matrices, so to fruitfully benefit from this approach would be developing FFT routines efficient for small matrices. I just want to indicate that there are some algorithms of this kind, please see for example the paper: Small Discrete Fourier Transforms on GPUs. I have no direct experience with the performance of CUDA FFTs on small matrices of the indicated type, but perhaps it could be interesting for you since the mask matrices are in a low number (10) and so you can "recycle" their FFTs for a large number of convolutions (5000).
If you decide not to use the FFT approach, then, if you have a GPU architecture with compute capability >=3.5, then dynamic parallelism could be a good candidate to calculate convolutions. If you regard the evaluation of each convolution matrix element as an interpolation, then you will have interpolation problems of size 15x15 and dynamic parallelism could help, see the post: Benefit of splitting a big CUDA kernel and using dynamic parallelism

Best approach for convolution of multiple small matrices using CUDA

2 Answers