I am totally new in cuda and I would like to write a cuda kernel that calculates a convolution given an input matrix, convolution (or filter) and an output matrix.
Note: I want each thread of the cuda kernel to calculate one value in the output matrix.
How can I do this?