3
votes

I am currently trying to train a chat bot, more specifically this one. But when I start to train the chat bot it utilizes 100% of my CPU and roughly 10% of my GPU. Does someone possibly have an idea why.

GPU utilization CPU utilization

I have installed tensorflow-gpu and have made sure I have the correct version of CUDA and cuDNN. I have also made sure that I do not have the base tensorflow pip package installed. I also have the latest Nvidia drivers for my GPU. I have also tried uninstalling and re-installing all my drivers, CUDA, cuDNN, tensorflow-gpu and all its dependencies and python itself - which none of it worked.

I can create a python script and include with tf.device('/gpu:0'); and create a graph with it without issue, so it is definitely detecting the GPU but just doesn't seem to utilize it.

When running sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) I get an output of the following:

2019-05-22 16:47:00.168170: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2

2019-05-22 16:47:00.433514: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1105] Found device 0 with properties:

name: GeForce GTX 1060 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.48

pciBusID: 0000:01:00.0

totalMemory: 6.00GiB freeMemory: 4.97GiB

2019-05-22 16:47:00.450094: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)

Device mapping:

/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1

2019-05-22 16:47:01.391802: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\direct_session.cc:297] Device mapping:

/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1

1

1 Answers

2
votes

It doesn't look like there's any issue with your GPU setup (especially if you can confirm the GPU is used more when you train than when you don't, using nvidia-smi for example)

Note however that your GPU is not necessarily going to be the bottleneck in your training, meaning that some CPU-only intensive compute like data augmentation might just be so slow that your GPU gets underutilized.

I'd advise profiling your training code to see what's taking all that CPU power.