Calling multiple kernels, global memory performances - CUDA

Question

I have four CUDA kernels working on matrices in the following way:

convolution<<<>>>(A,B);
multiplybyElement1<<<>>>(B);
multiplybyElement2<<<>>>(A);
multiplybyElement3<<<>>>(C);

// A + B + C with CUBLAS' cublasSaxpy

every kernel basically (except the convolution first) performs a matrix each-element multiplication by a fixed value hardcoded in its constant memory (to speed things up).

Should I join these kernels into a single one by calling something like

multiplyBbyX_AbyY_CbyZ<<<>>>(B,A,C)

?

Global memory should already be on the device so probably that would not help, but I'm not totally sure

Can you try to test both versions and see which is better? Anyway, since you are reusing data already in memory across kernel calls, I doubt there will be any difference in performance. — Tudor

Tom Tom · Accepted Answer · 2012-04-14T20:15:47

If I understood correctly, you're asking if you should merge the three "multiplybyElement" kernels into one, where each of those kernels reads an entire (different) matrix, multiplying each element by a constant, and storing the new scaled matrix.

Given that these kernels will be memory bandwidth bound (practically no computation, just one multiply for every element) there is unlikely to be any benefit from merging the kernels unless your matrices are small, in which case you would be making inefficient use of the GPU since the kernels will execute in series (same stream).

Calling multiple kernels, global memory performances - CUDA

2 Answers