Problem type: regression
Inputs: sequence length varies from 14 to 39, each sequence point is a 4-element vector.
Output: a scalar
Neural Network: 3-layer Bi-LSTM (hidden vector size: 200) followed by 2 Fully Connected layers
Batch Size: 30
Number of samples per epoch: ~7,000
TensorFlow version: tf-nightly-gpu 1.6.0-dev20180112
CUDA version: 9.0
CuDNN version: 7
Details of the two GPUs:
GPU 0: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 totalMemory: 11.00GiB freeMemory: 10.72GiB
nvidia-smi during the run (using 1080 Ti only):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.69 Driver Version: 385.69 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:02:00.0 Off | N/A |
| 20% 37C P2 58W / 250W | 10750MiB / 11264MiB | 10% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro K1200 WDDM | 00000000:03:00.0 On | N/A |
| 39% 35C P8 1W / 31W | 751MiB / 4096MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
GPU 1: name: Quadro K1200 major: 5 minor: 0 memoryClockRate(GHz): 1.0325 totalMemory: 4.00GiB freeMemory: 3.44GiB
nvidia-smi during the run (using K1200 only):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.69 Driver Version: 385.69 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:02:00.0 Off | N/A |
| 20% 29C P8 8W / 250W | 136MiB / 11264MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro K1200 WDDM | 00000000:03:00.0 On | N/A |
| 39% 42C P0 6W / 31W | 3689MiB / 4096MiB | 23% Default |
+-------------------------------+----------------------+----------------------+
Time spent for 1 epoch:
GPU 0 only (set environment var "CUDA_VISIBLE_DEVICES"=0): ~60 minutes
GPU 1 only (set environment var "CUDA_VISIBLE_DEVICES"=1): ~45 minutes
Set env. var. to "TF_MIN_GPU_MULTIPROCESSOR_COUNT=4" during both tests.
Why is the better GPU (GeForce GTX 1080 Ti) slower on training my neural network?
Thanks in advance.
Update
Another set of tests on MNIST dataset using a CNN model showed the same pattern:
Time spent for training 17 epochs:
GPU 0 (1080 Ti): ~59 minutes
GPU 1 (K1200): ~45 minutes