I am using both Tesla k40 and GTX Titan X and I have Cuda 8.0 The functions that I use are CuBlas and CuSparse library functions: cusparseDcsrsv2_solve(); cusparseDcsrmv(); cublasDdot();
Why GTX Titan X is faster than K40? I am compiling nvcc with flag for all compute capability from 3.0 to 6.0 and my program handles 9 GB / 12 GB RAM. About me the library functions don't use double precision because with floating points GTX TITAN X has 6.xx Tflops and K40 has 4.xx Tflops while with floating points double GTX TITAN X has 2xx GFlops and K40 has 1.xx Tflops. In theory K40 has to be faster than GTX TITAN X What could it be my problem? It's so strange.
deviceQuery
sample code to make sure you understand the behavior, then re-run your actual codes. I don't really think that your codes are running on the GT 750 (my sense is they should not) but its worth a test. – Robert Crovella