Running Tensorflow on CPU is faster than running it on GPU

Question

I have an ASUS n552vw laptop that has a 4GB dedicated Geforce GTX 960M graphic card. I put these lines of code in the beginning of my code to compare training speed using GPU or CPU, and I saw it seems using the CPU wins!

For GPU:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

For CPU:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

I have installed CUDA, cuDNN, tensorflow-gpu, etc to increase my training speed but seems inverse thing happened!

When I try the first code, it says(before execution start):

Train on 2128 samples, validate on 22 samples
Epoch 1/1
2019-08-02 18:49:41.828287: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-08-02 18:49:42.457662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:01:00.0
totalMemory: 4.00GiB freeMemory: 3.34GiB
2019-08-02 18:49:42.458819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-08-02 18:49:43.776498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-02 18:49:43.777007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-08-02 18:49:43.777385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-08-02 18:49:43.777855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3050 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
2019-08-02 18:49:51.834610: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally

And it's really slow [Finished in 263.2s], But when I try the second code it says:

Train on 2128 samples, validate on 22 samples
Epoch 1/1
2019-08-02 18:51:43.021867: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-08-02 18:51:43.641123: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-08-02 18:51:43.645072: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA diagnostic information for host: DESKTOP-UQ8B9FK
2019-08-02 18:51:43.645818: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname: DESKTOP-UQ8B9FK

And it's much faster than the first code [Finished in 104.7s] ! How is it possible??

EDIT: This is the part of code that is related to Tensorflow :

model = Sequential()
model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp)) 

model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp)) 

model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp)) 

model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp)) 

model.add((LSTM(un , return_sequences = False)))
model.add(Dropout(dp)) 

model.add(RepeatVector(rp))

model.add((LSTM(un , return_sequences= True))) 
model.add(Dropout(dp))   

model.add((LSTM(un , return_sequences= True))) 
model.add(Dropout(dp))

model.add((LSTM(un , return_sequences= True))) 
model.add(Dropout(dp))

model.add((LSTM(un , return_sequences= True))) 
model.add(Dropout(dp))

model.add((LSTM(un , return_sequences= True))) 
model.add(Dropout(dp))

model.add(TimeDistributed(Dense(ds)))

@MatiasValdenegro: The code is big and private, but I use LSTM model of keras library. I also tried different batch_sizes from 5 to 50 but no difference happened. — ensan3kamel
Sure but then just describe the model, how many layers? how many parameters? — Dr. Snoopy

Dr. Snoopy Dr. Snoopy · Accepted Answer · 2019-08-02T16:07:02

There are two relevant issues here:

A model needs to be "big enough" in order to profit from GPU acceleration, as training data needs to be transferred to the GPU, and new weights need to be downloaded from the GPU, and this overhead reduces the efficiency, making things slower.
For recurrent layers, paralellizing them is not easy, as they have a lot of sequential computation across timesteps. You might consider using the CuDNNLSTM layer instead of the normal LSTM, as it is optimized for GPU usage.

In general for a small model, training on GPU might not be faster than training on CPU.

Running Tensorflow on CPU is faster than running it on GPU

1 Answers