CUBLAS dgemm performance query

Question

These are my results of running cublas DGEMM on 4 GPUs using 2 streams for each GPU (Tesla M2050):

enter image description here

I have tested my results and they are alright; I am concerned about the high Gflops value that I am getting, compared with the versions that uses the default stream. I am calculating the Gflops using the formula:

Gflops = {2.0*10^-9*(N^3+N^2)}/elapsed_time_in_s

For the version that uses multiple streams, do I need to modify this formula in any way?

The HtoD-ker-DtoH is the time taken for host to device data transfer, kernel execution and device to host data transfer in seconds (this is the denominator of the formula above). Crosspost to Nvidia forums - http://forums.nvidia.com/index.php?showtopic=219910&st=0#entry1350908

EDIT: Following the comment of @talonmies, I added a cudaStreamSynchronize before calculating the time, and the results are as follows:

enter image description here

Thanks,

Sayan

What do you mean when you say "running on 4 GPUs" and what does that mean for the DGEMM operation. Are you splitting the DGEMM up over 4 devices or something else? — talonmies
I am splitting data in 4 parts for each GPU and then running cublasdgemm on the chunks (on each GPU)... — Sayan
A single C2050 gives about 550 GFLOP/s peak, or about 2200 GFLOP/s for 4 peak for double precision, and DGEMM is considerably lower than peak), so I would guess that you timing is wrong in the streams case (probably something that was synchronous in the default stream case is now asynchronous). The FLOP/s calculation should not change no matter how you do the computations. — talonmies
Thank you, I have added a cudaStreamSynchronize before I calculate time and I get reasonable results (added in EDIT). — Sayan

talonmies talonmies · Accepted Answer · 2013-11-10T11:58:20

A single C2050 gives about 550 GFLOP/s peak, or about 2200 GFLOP/s for 4 peak for double precision, and DGEMM is considerably lower than peak), so I would guess that you timing is wrong in the streams case (probably something that was synchronous in the default stream case is now asynchronous). The FLOP/s calculation should not change no matter how you do the computations.

I would review your code to ensure that whatever timing mechanism you use is synchronized to all the streams you launch, either via the cudaStreamWaitEvent mechanism across all streams, or cudaStreamSynchronize per stream. It is likely that the timing is falling out of the code you are trying to time before the GPU has finishing the CUBLAS operations.

CUBLAS dgemm performance query

1 Answers