I need to preform multiple convolutions with small matrices and kernels, and I was hoping that utilizing the many processors of the GPU would enable me to it as fast as possible.
The problem is as follows: I have many matrices (~1,000 to ~10,000) or relatively small sizes (~15x15 down to 1x1 - as in scalar), and a certain number of convolution masks (~20 to 1). I need to convolve all the matrices with each convolution mask example:
A; %5,000 matrices of size 10x10, A(i) = a 10x10 matrix
B; 10 matrices of size 5x5, B(k) = a 5x5 matrix
res(j)=conv(A,B(1)); %res(j) is the result of convolving all 5,000
%matrices in A by the j'th kernel B(j)
the goal is computing res(1),...,res(10) as quickly as possible
I would like to hear suggestions about how to implement the most efficient algorithm. FFT based convolution would probably be too slow.
Every implementation I've seen so far is for 2d convolution, meant to convolve 2 large matrices, while I need to convolve many small matrices.
I know very little about CUDA programming right now, but I'm in the process of learning.
I was hoping to figure this out myself, but due to time constraints, I am forced to ask for any advice anyone with experience can give me, while I learn how to code in CUDA.
Thank you!
p.s. any pointers to an implementation that suits my purposes is more than appreciated. I am a university students, and this is for a small research project, so nothing I need to pay for please...