CUDA kernel 2x2 ZgemmBatched: 5x faster then CuBLAS. Can it go faster?

Question

I have seen poor performance of the CuBLAS API when using small matrices especially the batched general matrix multiplication. My goal is to write a fast kernel to calculate batched complex double matrix multiplications for a matrix size of 2x2. The kernel should be able extract these matrices from larger square matrices, i.e. block matrix multiplications.

The kernel that I have written calculates 8 2x2 double complex matrix multiplications per block (32 threads). I can achieve about 1.65 double GFlops (NSight analysis) on a GTX 970 (compute capability 5.2) which should perform 109 double GFlops. It takes about 300ms to calculate 9.1 million matrix multiplications.

*a, *b and *c are pointers to an array of N times nxn matrices. These matrices must be squared.

lda, ldb, ldc are leading dimensions of all input matrices. Using these parameters any square matrix array can be used to extract 2x2 matrices.

Question: How do I increase the performance of this kernel?

#define THREADMULTIPLIER 8

__global__ void bulkMatrixMul(const cuDoubleComplex *a, int lda, const cuDoubleComplex *b, int ldb, cuDoubleComplex *c, int ldc){
    int pos = (THREADMULTIPLIER * blockIdx.x + threadIdx.z);
    int ia = pos * lda * lda + threadIdx.x;
    int ib = (pos * ldb + threadIdx.y) * ldb;
    c[(pos * ldc + threadIdx.y) * ldc + threadIdx.x] = cuCfma(a[ia], b[ib], cuCmul(a[ia + LD], b[ib + 1]));
};


void bulkBatchedMatrixMul(complex<double> *a, int lda, complex<double> *b, int ldb, complex<double> *c, int ldc, int batchSize, cudaStream_t *stream){
    dim3 threads(2, 2, THREADMULTIPLIER);
    bulkMatrixMul << <batchSize / THREADMULTIPLIER, threads, 0, *stream >> >((cuDoubleComplex*)a, lda, (cuDoubleComplex*)b, ldb, (cuDoubleComplex*)c , ldc);
    if (cudaSuccess != cudaGetLastError())
        printf("Error!\n");
}

I have made my kernel 10 times faster by analysing Christian Sarofeen's code:

it is faster if you write a sum as a whole instead of using += and -=.
declaring the output, especially if you split the real & imaginary calculations of the output increases speed.

The kernel below is an optimised kernel calculating 16 matrix multiplications per block. It can calculate 9.1 million matrix multiplications in 31 ms: I use a batchSize of 10000 and call this kernel 70 times per stream in 13 streams. According to NSight it achieves about 15 double GFlops. A GTX970 was used.

__global__ void bulkMatrixMul(const cuDoubleComplex *a, int lda, const cuDoubleComplex *b, int ldb, cuDoubleComplex *c, int ldc){
    int pos = (blockDim.z * blockIdx.x + threadIdx.z);
    int ia = pos * lda * lda + threadIdx.x;
    int ib = (pos * ldb + threadIdx.y) * ldb;
    cuDoubleComplex cR;
    cR.x = a[ia].x * b[ib].x - a[ia].y * b[ib].y - a[ia + lda].y * b[ib + 1].y + a[ia + lda].x * b[ib + 1].x;
    cR.y = a[ia].x * b[ib].y + a[ia].y * b[ib].x + a[ia + lda].x * b[ib + 1].y + a[ia + lda].y * b[ib + 1].x;
    c[(pos * ldc + threadIdx.y) * ldc + threadIdx.x] = cR;
};


void bulkBatchedMatrixMul(complex<double> *a, int lda, complex<double> *b, int ldb, complex<double> *c, int ldc, int batchSize, cudaStream_t *stream){
    dim3 threads(2, 2, 16);
    bulkMatrixMul <<< batchSize / 16, threads, 0, *stream >>>((cuDoubleComplex*)a, lda, (cuDoubleComplex*)b, ldb, (cuDoubleComplex*)c , ldc);
    if (cudaSuccess != cudaGetLastError())
        printf("Error!\n");
}

Just to clarify, you want to multiply many 2x2 double complex matrices? — Christian Sarofeen
Yes, I am multiplying 10 000 matrices in bulk 70 times in 13 streams which result in 9.1 million matrix multiplications. — Phoz
A memory chunck of Nnn size: A0A1A2A3...AN, each matrix is column-major. — Phoz

Christian Sarofeen Christian Sarofeen · Accepted Answer · 2015-02-20T21:45:11

I would assign a thread to each matrix multiplication, and dump everything into registers. Don't worry about occupancy as I suspect this will be memory bound. Once in registers each thread will do a matrix by matrix multiplication and store it in c. Below is a demonstration of this.

#include <stdio.h>
#include <iostream>

#include <cuComplex.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define N_MATRIX 10000

using namespace std;

__device__ void D_C_Mult_Add(cuDoubleComplex &a, cuDoubleComplex &b, cuDoubleComplex &c){
        c.x+=a.x*b.x;
        c.y+=a.x*b.y;
        c.x-=a.y*b.y;
        c.y+=a.y*b.x;
}

__global__ void multMatrix(cuDoubleComplex* a, cuDoubleComplex* b, cuDoubleComplex* c) {
    int tid = blockIdx.x*blockDim.x + threadIdx.x;
    int stride=gridDim.x*blockDim.x;
    if(tid>=N_MATRIX) return;
    cuDoubleComplex a_sub[4], b_sub[4], c_sub[4];

    for(int i=0; i<4; i++){
        a_sub[i]=a[tid+stride*i];
        b_sub[i]=a[tid+stride*i];
        c_sub[i].x=0.0;
        c_sub[i].y=0.0;
    }

    D_C_Mult_Add(a_sub[0], b_sub[0], c_sub[0]);
    D_C_Mult_Add(a_sub[0], b_sub[1], c_sub[1]);
    D_C_Mult_Add(a_sub[2], b_sub[0], c_sub[2]);
    D_C_Mult_Add(a_sub[2], b_sub[1], c_sub[3]);

    D_C_Mult_Add(a_sub[1], b_sub[2], c_sub[0]);
    D_C_Mult_Add(a_sub[1], b_sub[3], c_sub[1]);
    D_C_Mult_Add(a_sub[3], b_sub[2], c_sub[2]);
    D_C_Mult_Add(a_sub[3], b_sub[3], c_sub[3]);

    for(int i=0; i<4; i++)
        c[tid+stride*i]=c_sub[i];

}

int main(int argc, char **argv) {



    cuDoubleComplex *a, *b, *c;
    cudaMalloc(&a, sizeof(cuDoubleComplex)*N_MATRIX*4);
    cudaMalloc(&b, sizeof(cuDoubleComplex)*N_MATRIX*4);
    cudaMalloc(&c, sizeof(cuDoubleComplex)*N_MATRIX*4);

    multMatrix<<<N_MATRIX/128+1, 128>>>(a, b, c);
}

CUDA kernel 2x2 ZgemmBatched: 5x faster then CuBLAS. Can it go faster?

2 Answers