I have kernels that use one common (shared) arguments, but, each kernel do independant things : they do not overlap. How to make them overlap ?
The CUDA code looks like (k0 and k1 are launched in dedicated streams str[i]):
k0<<<..., str[0]>>>(arg1, arg2, arg3);
k1<<<..., str[1]>>>(arg4, arg2, arg5); // k2 uses also arg2
Also, I have intentionally declared kernels like :
k0(double const * const arg1, double const * const arg2, double * arg3);
k1(double const * const arg1, double const * const arg2, double * arg3);
that is with double const * const for arg2 in hope this hint would help nvcc.
nvvp shows that k2 runs in str[1] after k1 has finished to run in str[0].
kernels do not overlap : is this related to the common argument (arg2) ? If yes, how to make kernels overlapping each other in this case ?
Update
I use Kepler K20m :
- compute capability is 3.5.
- kernel concurrency is supported (the previous version I was working on [let's call it v0] implements a different strategy where kernels overlap [checked with nvvp] but are much smaller: with v0, kernels seems to be too small to be efficient [kernel launch time >= kernel run time]).
The algorithm is unfortunatelly always accessing data so I would expect it to be highly bandwidth-bounded (each thread reads a lot and often from several arrays, do a few addition or multiplication, then writes back to arrays - not much way to avoid this, or even to code it differently...). Note : there is "no if" in the algorithm so divergence execution is not a problem.
From v0, I implemented another version v1 to "make kernels bigger" : now, I have kernel launch time (50 microsec according to nvvp) << kernel run time (4 millisec according to nvvp) which seems to be a good thing to me. Also, V1 improved a lot both efficiency (store : 90%, load : 155%) and occupancy (achieved : 52%, theoretical : 62%) compared to v0. At this point, nvvp kernel performance limiter shows that the kernel is "compute-bounded" ("function units" : 85%, "memory" : 5%) which again seems to be a good thing to me (even if I was surprised as I expected the kernel to be bandwidth-bounded).
v1 still down-speeds when compared to CPU : according to nvvp, it seems 50% of "stall reasons" is "execution dependency" (pie chart - kernel latency). So from v1, I tried to implement v2 which is "v1 split in several independant pieces" (hope to increase instruction-level-parallelism by doing in parallel independant things) : execution dependency jumps to 70% and (independant) kernels do not overlap, this is the root cause that led me to StackOverflow...
According to Tom answer, I am not sure to know "how to check if the GPU is full". Kernels from v2 have 30% occupancy : to me, this means that there is already room for other kernel, no ?! I've just tried with smaller block / grid size but it doesn't seems to help (it's even worse : occupancy drops to 10%).
From, overall hints provided by nvvp (at application level), I get a message like "low concurrency kernels : percentage of time where 2 kernels are executed in parallel is low".
To conclude : I don't see any light at the end of the tunnel... Any good idea would be appreciated ! I feel like I miss something that stucks the GPU, but, I can't figure out what this is.
Update
I already tried to size grid and block with the CUDA Occupancy Calculator spreadsheet. It turns out that, for this algorithm, big blocks and big grids seems to perform better than small ones. The kernels do not use shared memory. I guess there are not enough registers for independant kernels to run concurrently : if so, what I observe would be logical. I will try to play on that.
Update
For the record, reducing register use does not help.
k0run for? What compute capabilities are you targeting? - Avi Ginsburg