I have four CUDA kernels working on matrices in the following way:
convolution<<<>>>(A,B);
multiplybyElement1<<<>>>(B);
multiplybyElement2<<<>>>(A);
multiplybyElement3<<<>>>(C);
// A + B + C with CUBLAS' cublasSaxpy
every kernel basically (except the convolution first) performs a matrix each-element multiplication by a fixed value hardcoded in its constant memory (to speed things up).
Should I join these kernels into a single one by calling something like
multiplyBbyX_AbyY_CbyZ<<<>>>(B,A,C)
?
Global memory should already be on the device so probably that would not help, but I'm not totally sure