CuSparse/CuBlas K40 vs GTX Titan X (Maxwell)

Question

I am using both Tesla k40 and GTX Titan X and I have Cuda 8.0 The functions that I use are CuBlas and CuSparse library functions: cusparseDcsrsv2_solve(); cusparseDcsrmv(); cublasDdot();

Why GTX Titan X is faster than K40? I am compiling nvcc with flag for all compute capability from 3.0 to 6.0 and my program handles 9 GB / 12 GB RAM. About me the library functions don't use double precision because with floating points GTX TITAN X has 6.xx Tflops and K40 has 4.xx Tflops while with floating points double GTX TITAN X has 2xx GFlops and K40 has 1.xx Tflops. In theory K40 has to be faster than GTX TITAN X What could it be my problem? It's so strange.

try cublasDgemm. It will be faster on K40. The other functions you list may very well be memory bandwidth bound, not limited by compute throughput. This is generally the case with sparse matrix operations. The cublasDdot doesn't have enough compute intensity to make a difference - it is still bandwidth bound. — Robert Crovella
I haven't specified the workstations in my department. Both workstations have the same RAM, CPU, HDD ecc. the difference is that one has two GPUs (output nvidia-smi): GPU 0 GT 750 to use X GPU 1 TESLA K40 But in CUDA coda the ID TESLA K40 for cudaSetDevice is 0. The other workstation has only GTX TITAN X. I thought that difference was the memory bandwidth because TITAN has DDR5 RAM while K40 has DDR3 RAM. But about you my first configuration may be installed wrongly? — Alessandro D'Auria
Yes, if the codes are running on the GT 750 instead of the Tesla K40, then you'll certainly be disappointed. You can use the CUDA_VISIBLE_DEVICES environment variable to force the codes to run on the K40 in that machine. Experiment with that variable and the deviceQuery sample code to make sure you understand the behavior, then re-run your actual codes. I don't really think that your codes are running on the GT 750 (my sense is they should not) but its worth a test. — Robert Crovella

einpoklum einpoklum · Accepted Answer · 2017-03-19T14:45:13

First of all, the answer to these questions is typically: Profile your kernels and you'll learn what exactly runs slower.

I will say, though, it's not true that a K40 is supposed to be faster than a Maxwell Titan X:

Clock speed: Titan X: 1000 MHz , Tesla K40: 745 MHz.
Memory bandwidth: Titan X: 336 GB/sec, Tesla K40: 288 GB/sec.
Number of "CUDA cores" (i.e. maximum simultaneously-executing lanes in multiprocessor vectorized registers): Titan X 3072, Tesla K40: 2888.

so the Titan X has a bunch of stats working in its favor, not to mention the fact that it's a different microarchitecture, which can always mix things up performance-wise even with the same 'raw' statistics. Thus at least for some workloads, the Titan X should be faster.

Of course, as others suggest, for double-precision floating-point performance proper, the K40 should best the Titan X easily: The K40 has silicon for 1430G FMA ops/sec and the Titan X only for 192 (!)

CuSparse/CuBlas K40 vs GTX Titan X (Maxwell)

1 Answers