1
votes

I'm using the pre-built AI Platform Jupyter Notebook instances to train a model with a single Tesla K80 card. The issue is that I don't believe the model is actually training on the GPU.

nvidia-smi returns the following during training:

No Running Processes Found

Not the "No Running Process Found" yet "Volatile GPU Usage" is 100%. Something seems strange...

...And the training is excruciatingly slow.

A few days ago, I was having issues with the GPU not being released after each notebook run. When this occurred I would receive a OOM (Out of memory error). This required me to go into the console every time, find the GPU running process PID and use kill -9 before re-running the notebook. However, today, I can't get the GPU to run at all? It never shows a running process.

I've tried 2 different GCP AI Platform Notebook instances (both of the available tensorflow version options) with no luck. Am I missing something with these "pre-built" instances.

Pre-Built AI Platform Notebook Section

Just to clarify, I did not build my own instance and then install access to Jupyter notebooks. Instead, I used the built-in Notebook instance option under the AI Platform submenu.

Do I still need to configure a setting somewhere or install a library to continue using/reset my chosen GPU? I was under the impression that the virtual machine was already loaded with the Nvidia stack and should be plug and play with GPUs.

Thoughts?

EDIT: Here is a full video of the issue as requested --> https://www.youtube.com/watch?v=N5Zx_ZrrtKE&feature=youtu.be

1
Can you please the exact steps you needed to reproduce this issue? (e.g. 1. Go here 2. Click that 3. Enter this 4. ...etc. You could even add a video of yourself creating a new notebook and showing this error). The ideal level of detail would be something that would let a 12 year old with a GCP account could reproduce your problem on his own accountZain Rizvi
Did you try pip uninstall tensorflow + pip3 uninstall tensorflow and then pip3 install tensorflow-gpu ?razimbres
@ZainRizvi I will try to accomplish a video, but basically the steps are... 1. Create a notebook instance using the AI Platform menu option with tensorflow2.0, 100gb hdd, and a single tesla K80 GPU card. 2. Make sure the checkbox for install nvidia drivers is checked 3. Turn on the instance and open the notebook for the first time 4. Upload my dataset from my local hdd via the build in jupyter notebook upload option 5. Write my model/training codeChase Brumfield
6. Train my model 7. Model trains just as slow as if i'm running it on my MacBook Air CPU 8. Running nvidia-smi during model training still shows "No Running Processes"Chase Brumfield
What code are you using to train? github.com/aymericdamien/TensorFlow-Examples/blob/master/… can you start simple code like this?gogasca

1 Answers

3
votes

Generally speaking, you'll want to try to debug issues like this using the smallest possible bit of code that could reproduce your error. That removes many possible causes for the issue you're seeing.

In this case, you can check if your GPUs are being used by running this code (copied from the TensorFlow 2.0 GPU instructions):

import tensorflow as tf
print("GPU Available: ", tf.test.is_gpu_available())

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

Running it on the same TF 2.0 Notebook gives me the output:

GPU Available:  True
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

That right there shows that it's using the GPUs

Similarly, if you need more evidence, running nvidia-smi gives the output:

jupyter@tf2:~$ nvidia-smi
Tue Jul 30 00:59:58 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    58W / 149W |  10900MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7852      C   /usr/bin/python3                           10887MiB |
+-----------------------------------------------------------------------------+

So why isn't your code using GPUs? You're using a library someone else wrote, probably for tutorial purposes. Most likely those library functions are doing something that is causing CPUs to be used instead of GPUs.

You'll want to debug that code directly.