1
votes

I have four CUDA kernels working on matrices in the following way:

convolution<<<>>>(A,B);
multiplybyElement1<<<>>>(B);
multiplybyElement2<<<>>>(A);
multiplybyElement3<<<>>>(C);

// A + B + C with CUBLAS' cublasSaxpy

every kernel basically (except the convolution first) performs a matrix each-element multiplication by a fixed value hardcoded in its constant memory (to speed things up).

Should I join these kernels into a single one by calling something like

multiplyBbyX_AbyY_CbyZ<<<>>>(B,A,C)

?

Global memory should already be on the device so probably that would not help, but I'm not totally sure

2
Can you try to test both versions and see which is better? Anyway, since you are reusing data already in memory across kernel calls, I doubt there will be any difference in performance. - Tudor

2 Answers

0
votes

If I understood correctly, you're asking if you should merge the three "multiplybyElement" kernels into one, where each of those kernels reads an entire (different) matrix, multiplying each element by a constant, and storing the new scaled matrix.

Given that these kernels will be memory bandwidth bound (practically no computation, just one multiply for every element) there is unlikely to be any benefit from merging the kernels unless your matrices are small, in which case you would be making inefficient use of the GPU since the kernels will execute in series (same stream).

0
votes

If merging the kernels means that you can do only one pass over the memory, then you may see a 3x speedup.

Can you multiply up the fixed values up front and then do a single multiply in a single kernel?