Assume you have a dense matrix of size 1500x500 and you need to multiply it with a block-diagonal matrix of size 500x500 that consists of ten sub-matrices of size 50x50 sitting on the diagonal:
S 0 ... 0 0
0 S 0 0
...
0 0 ... S 0
0 0 ... 0 S <- each S is 50x50
Sometimes all S are equal, sometimes they're not.
I haven't profiled yet but I feel like a straight CUBLAS multiplication would waste too much time with the zeros. Are there any efficient ways to implement such a multiplication?