These are my results of running cublas DGEMM on 4 GPUs using 2 streams for each GPU (Tesla M2050):

I have tested my results and they are alright; I am concerned about the high Gflops value that I am getting, compared with the versions that uses the default stream. I am calculating the Gflops using the formula:
Gflops = {2.0*10^-9*(N^3+N^2)}/elapsed_time_in_s
For the version that uses multiple streams, do I need to modify this formula in any way?
The HtoD-ker-DtoH is the time taken for host to device data transfer, kernel execution and device to host data transfer in seconds (this is the denominator of the formula above).
Crosspost to Nvidia forums - http://forums.nvidia.com/index.php?showtopic=219910&st=0#entry1350908
EDIT: Following the comment of @talonmies, I added a cudaStreamSynchronize before calculating the time, and the results are as follows:

Thanks,
Sayan
cublasdgemmon the chunks (on each GPU)... - Sayan