Why is GeForce GTX 1080 Ti slower than Quadro K1200 on training a RNN model?

Question

Problem type: regression

Inputs: sequence length varies from 14 to 39, each sequence point is a 4-element vector.

Output: a scalar

Neural Network: 3-layer Bi-LSTM (hidden vector size: 200) followed by 2 Fully Connected layers

Batch Size: 30

Number of samples per epoch: ~7,000

TensorFlow version: tf-nightly-gpu 1.6.0-dev20180112

CUDA version: 9.0

CuDNN version: 7

Details of the two GPUs:

GPU 0: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 totalMemory: 11.00GiB freeMemory: 10.72GiB

device_placement_log_0.txt

nvidia-smi during the run (using 1080 Ti only):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.69                 Driver Version: 385.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108... WDDM  | 00000000:02:00.0 Off |                  N/A |
| 20%   37C    P2    58W / 250W |  10750MiB / 11264MiB |     10%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K1200       WDDM  | 00000000:03:00.0  On |                  N/A |
| 39%   35C    P8     1W /  31W |    751MiB /  4096MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

GPU 1: name: Quadro K1200 major: 5 minor: 0 memoryClockRate(GHz): 1.0325 totalMemory: 4.00GiB freeMemory: 3.44GiB

device_placement_log_1.txt

nvidia-smi during the run (using K1200 only):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.69                 Driver Version: 385.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108... WDDM  | 00000000:02:00.0 Off |                  N/A |
| 20%   29C    P8     8W / 250W |    136MiB / 11264MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K1200       WDDM  | 00000000:03:00.0  On |                  N/A |
| 39%   42C    P0     6W /  31W |   3689MiB /  4096MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+

Time spent for 1 epoch:

GPU 0 only (set environment var "CUDA_VISIBLE_DEVICES"=0): ~60 minutes

GPU 1 only (set environment var "CUDA_VISIBLE_DEVICES"=1): ~45 minutes

Set env. var. to "TF_MIN_GPU_MULTIPROCESSOR_COUNT=4" during both tests.

Why is the better GPU (GeForce GTX 1080 Ti) slower on training my neural network?

Thanks in advance.

Update

Another set of tests on MNIST dataset using a CNN model showed the same pattern:

Time spent for training 17 epochs:

GPU 0 (1080 Ti): ~59 minutes

GPU 1 (K1200): ~45 minutes

@AlexandrePassos, Yes, the Quadro K1200 was used for graphics (two mornitors, resolutions: 1920x1200 and 1280x1024). The GeForce GTX 1080 Ti was not used for graphics or activities other than training the model. — Maosi Chen
One out of two options: (1) TF is deciding which GPU is 0 and which is 1 different from nvidia (look in the tf startup logs to see what it decides) or (2) this particular model is faster on the CPU than on the GPU (tf by default won't run on quadro k1200 because there is not enough compute capacity on it). Can you log device placement to see? — Alexandre Passos
Can you log the op device placement to confirm that when you're using the slower GPU the GPU is actually being used? — Alexandre Passos
Can you show us what happens if you run nvidia-smi during the computation with both GPUs? — Alexandre Passos

Maosi Chen Maosi Chen · Accepted Answer · 2018-01-30T22:46:58

The official tensorflow document has the section "Allowing GPU memory growth" introducing two session options to control GPU memory allocation. I tried them separately to train my RNN model (using only GeForce GTX 1080 Ti):

config.gpu_options.allow_growth = True and
config.gpu_options.per_process_gpu_memory_fraction = 0.05

Both of them shortened the training time from the original ~60 minutes per epoch to ~42 minutes per epoch. I still don't understand why this helps. If you can explain it, I will accept that as the answer. Thanks.

Why is GeForce GTX 1080 Ti slower than Quadro K1200 on training a RNN model?

1 Answers