I have seen poor performance of the CuBLAS API when using small matrices especially the batched general matrix multiplication. My goal is to write a fast kernel to calculate batched complex double matrix multiplications for a matrix size of 2x2. The kernel should be able extract these matrices from larger square matrices, i.e. block matrix multiplications.
The kernel that I have written calculates 8 2x2 double complex matrix multiplications per block (32 threads). I can achieve about 1.65 double GFlops (NSight analysis) on a GTX 970 (compute capability 5.2) which should perform 109 double GFlops. It takes about 300ms to calculate 9.1 million matrix multiplications.
*a
, *b
and *c
are pointers to an array of N times nxn matrices. These matrices must be squared.
lda
, ldb
, ldc
are leading dimensions of all input matrices. Using these parameters any square matrix array can be used to extract 2x2 matrices.
Question: How do I increase the performance of this kernel?
#define THREADMULTIPLIER 8
__global__ void bulkMatrixMul(const cuDoubleComplex *a, int lda, const cuDoubleComplex *b, int ldb, cuDoubleComplex *c, int ldc){
int pos = (THREADMULTIPLIER * blockIdx.x + threadIdx.z);
int ia = pos * lda * lda + threadIdx.x;
int ib = (pos * ldb + threadIdx.y) * ldb;
c[(pos * ldc + threadIdx.y) * ldc + threadIdx.x] = cuCfma(a[ia], b[ib], cuCmul(a[ia + LD], b[ib + 1]));
};
void bulkBatchedMatrixMul(complex<double> *a, int lda, complex<double> *b, int ldb, complex<double> *c, int ldc, int batchSize, cudaStream_t *stream){
dim3 threads(2, 2, THREADMULTIPLIER);
bulkMatrixMul << <batchSize / THREADMULTIPLIER, threads, 0, *stream >> >((cuDoubleComplex*)a, lda, (cuDoubleComplex*)b, ldb, (cuDoubleComplex*)c , ldc);
if (cudaSuccess != cudaGetLastError())
printf("Error!\n");
}
I have made my kernel 10 times faster by analysing Christian Sarofeen's code:
- it is faster if you write a sum as a whole instead of using += and -=.
- declaring the output, especially if you split the real & imaginary calculations of the output increases speed.
The kernel below is an optimised kernel calculating 16 matrix multiplications per block. It can calculate 9.1 million matrix multiplications in 31 ms: I use a batchSize of 10000 and call this kernel 70 times per stream in 13 streams. According to NSight it achieves about 15 double GFlops. A GTX970 was used.
__global__ void bulkMatrixMul(const cuDoubleComplex *a, int lda, const cuDoubleComplex *b, int ldb, cuDoubleComplex *c, int ldc){
int pos = (blockDim.z * blockIdx.x + threadIdx.z);
int ia = pos * lda * lda + threadIdx.x;
int ib = (pos * ldb + threadIdx.y) * ldb;
cuDoubleComplex cR;
cR.x = a[ia].x * b[ib].x - a[ia].y * b[ib].y - a[ia + lda].y * b[ib + 1].y + a[ia + lda].x * b[ib + 1].x;
cR.y = a[ia].x * b[ib].y + a[ia].y * b[ib].x + a[ia + lda].x * b[ib + 1].y + a[ia + lda].y * b[ib + 1].x;
c[(pos * ldc + threadIdx.y) * ldc + threadIdx.x] = cR;
};
void bulkBatchedMatrixMul(complex<double> *a, int lda, complex<double> *b, int ldb, complex<double> *c, int ldc, int batchSize, cudaStream_t *stream){
dim3 threads(2, 2, 16);
bulkMatrixMul <<< batchSize / 16, threads, 0, *stream >>>((cuDoubleComplex*)a, lda, (cuDoubleComplex*)b, ldb, (cuDoubleComplex*)c , ldc);
if (cudaSuccess != cudaGetLastError())
printf("Error!\n");
}