2
votes

I am totally new in cuda and I would like to write a cuda kernel that calculates a convolution given an input matrix, convolution (or filter) and an output matrix.

Note: I want each thread of the cuda kernel to calculate one value in the output matrix.

How can I do this?

3
As far as I remember there were dozens of examples on the CUDA website. Especially given the fact that convolution is a very common task. Has this changed or haven't you found anything there? - CWBudde
@CWBudde thank you for your comment. Yes i found couple of long example with many hard cases all over the websites, but I haven't find straightforward one yet unfortunately. I will be more than happy if you have any. - Bilgin

3 Answers

1
votes

If the filters cover fill range of the matrix, then it can be directly converted to cublasSgemm.

For example, suppose the dimensions of the matrix is 5 * 4, and you need 130 filters, then the filters matrix to be trained is of dimensions 130 * 20, and the 5 * 4 matrix can be taken as 20 * 1.

In this way, the computation speed is optimal; it's converted to matrix multiplication between m1 (130, 20) and m2 (20, 1).

1
votes

I would like to write a cuda kernel that calculates a convolution given an input matrix, convolution (or filter) and an output matrix.

You might be interested in this treatment of the subject (although it's a little old). Or look at the CUDA convolution kernel sample programs: non-separable and separable

I want each thread of the cuda kernel to calculate one value in the output matrix.

If you follow the link, you'll realize you don't quite want that. In other words: Don't make rigid assumptions regarding how your kernel should divide work among the threads, you might change your mind later.

0
votes

if you are looking for a image convolution kernel, this link may be helpful (Two Dimensional (2D) Image Convolution in CUDA by Shared & Constant Memory: An Optimized way ).

As far as I concerned, using each thread to calculate a pixel or position in output may not be a very good idea. Please consider how the sub-region for this convolution is loaded, or whether the threads in the same warp are reading continuous memory at each read. Otherwise, the kernel may suffer from the data loading even though over hundreds of threads are available.

Therefore, basically you can just write the code your described, and using the profiler (nvvp) for further optimization suggestions.