CUDA : kernel concurrency with common (shared) argument

Question

I have kernels that use one common (shared) arguments, but, each kernel do independant things : they do not overlap. How to make them overlap ?

The CUDA code looks like (k0 and k1 are launched in dedicated streams str[i]):

k0<<<..., str[0]>>>(arg1, arg2, arg3);
k1<<<..., str[1]>>>(arg4, arg2, arg5); // k2 uses also arg2

Also, I have intentionally declared kernels like :

k0(double const * const arg1, double const * const arg2, double * arg3);
k1(double const * const arg1, double const * const arg2, double * arg3);

that is with double const * const for arg2 in hope this hint would help nvcc.

nvvp shows that k2 runs in str[1] after k1 has finished to run in str[0].

kernels do not overlap : is this related to the common argument (arg2) ? If yes, how to make kernels overlapping each other in this case ?

Update

I use Kepler K20m :

compute capability is 3.5.
kernel concurrency is supported (the previous version I was working on [let's call it v0] implements a different strategy where kernels overlap [checked with nvvp] but are much smaller: with v0, kernels seems to be too small to be efficient [kernel launch time >= kernel run time]).

The algorithm is unfortunatelly always accessing data so I would expect it to be highly bandwidth-bounded (each thread reads a lot and often from several arrays, do a few addition or multiplication, then writes back to arrays - not much way to avoid this, or even to code it differently...). Note : there is "no if" in the algorithm so divergence execution is not a problem.

From v0, I implemented another version v1 to "make kernels bigger" : now, I have kernel launch time (50 microsec according to nvvp) << kernel run time (4 millisec according to nvvp) which seems to be a good thing to me. Also, V1 improved a lot both efficiency (store : 90%, load : 155%) and occupancy (achieved : 52%, theoretical : 62%) compared to v0. At this point, nvvp kernel performance limiter shows that the kernel is "compute-bounded" ("function units" : 85%, "memory" : 5%) which again seems to be a good thing to me (even if I was surprised as I expected the kernel to be bandwidth-bounded).

v1 still down-speeds when compared to CPU : according to nvvp, it seems 50% of "stall reasons" is "execution dependency" (pie chart - kernel latency). So from v1, I tried to implement v2 which is "v1 split in several independant pieces" (hope to increase instruction-level-parallelism by doing in parallel independant things) : execution dependency jumps to 70% and (independant) kernels do not overlap, this is the root cause that led me to StackOverflow...

According to Tom answer, I am not sure to know "how to check if the GPU is full". Kernels from v2 have 30% occupancy : to me, this means that there is already room for other kernel, no ?! I've just tried with smaller block / grid size but it doesn't seems to help (it's even worse : occupancy drops to 10%).

From, overall hints provided by nvvp (at application level), I get a message like "low concurrency kernels : percentage of time where 2 kernels are executed in parallel is low".

To conclude : I don't see any light at the end of the tunnel... Any good idea would be appreciated ! I feel like I miss something that stucks the GPU, but, I can't figure out what this is.

Update

I already tried to size grid and block with the CUDA Occupancy Calculator spreadsheet. It turns out that, for this algorithm, big blocks and big grids seems to perform better than small ones. The kernels do not use shared memory. I guess there are not enough registers for independant kernels to run concurrently : if so, what I observe would be logical. I will try to play on that.

Update

For the record, reducing register use does not help.

How long does k0 run for? What compute capabilities are you targeting? — Avi Ginsburg

Tom Tom · Accepted Answer · 2015-06-08T11:10:22

kernels do not overlap : is this related to the common argument (arg2) ?

No, it's not because they share an input argument.

By using CUDA streams you, the programmer, are explicitly saying that k0 and k1 can be executed concurrently, so even if you replace arg5 with arg3 the hardware can still execute them concurrently.

Since this is not happening:

Remember that the kernels will only run simultaneously if the first kernel does not fill the GPU, so check the grid size - if the first kernel fills the entire GPU then there will be no resources to run the second kernel until it finishes (you may have a tiny overlap as the first one drains and the second one starts)
What GPU are you running this on? Concurrent kernels requires compute capability 2.0 or greater, you can check the concurrentKernels device property to check (see the deviceQuery sample for example)

Based on the additional information you provided (please don't post additional info as an answer, you should modify your question or comment instead) we can conclude that (1) is the case here. The work scheduler will schedule all the blocks for one kernel before starting the next; since the first kernel takes several milliseconds to execute and consists of many threadblocks, the second kernel will only begin to execute as the last blocks start to finish.

Occupancy is a measure of the number of warps being managed by an SM, with an occupancy of 30% that means that some other factor is limiting how many warps can be scheduled (e.g. register usage, shared memory usage) - it does not mean we can run the second kernel. You can use the CUDA Occupancy Calculator spreadsheet to investigate occupancy further.

For further information on concurrent kernels you could also look at the simpleHyperQ documentation.

CUDA : kernel concurrency with common (shared) argument

Update

Update

Update

2 Answers