Tensorflow not using GPU for one dataset, where it does for a very similar dataset

Question

I'm using TensorFlow to train a model using data originating from two sources. For both sources the training and validation data shape are almost identical and the dtypes throughout are np.float32.

The strange thing is, when I use the first data set the GPU on my machine is used, but when using the second data set the GPU is not used.

Does anyone have some suggestions on how to investigate?

print(s1_train_data.shape)
print(s1_train_data.values)
(1165032, 941)
[[ 0.45031181 -0.99680316  0.63686389 ...,  0.22323072 -0.37929842  0.        ]
 [-0.40660214  0.34022757 -0.00710014 ..., -1.43051076 -0.14785887  1.        ]
 [ 0.03955967 -0.91227823  0.37887612 ...,  0.16451506 -1.02560401  0.        ]
 ..., 
 [ 0.11746094 -0.18229018  0.43319091 ...,  0.36532226 -0.48208624  0.        ]
 [ 0.110379   -1.07364404  0.42837444 ...,  0.74732345  0.92880726  0.        ]
 [-0.81027234 -1.04290771 -0.56407243 ...,  0.25084609 -0.1797282   1.        ]]

print(s2_train_data.shape)
print(s2_train_data.values)
(559873, 941)
[[ 0.          0.          0.         ..., -1.02008295  0.27371082  0.        ]
 [ 0.          0.          0.         ..., -0.74775815  0.18743835  0.        ]
 [ 0.          0.          0.         ...,  0.6469788   0.67864949  1.        ]
 ..., 
 [ 0.          0.          0.         ..., -0.88198501 -0.02421325  1.        ]
 [ 0.          0.          0.         ...,  0.28361112 -1.08478808  1.        ]
 [ 0.          0.          0.         ...,  0.22360609  0.50698668  0.        ]]

Edit. Here's a snip of the log with log_device_placement=True.

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 4.00GiB
Free memory: 3.95GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x7578380
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:04.0
Total memory: 4.00GiB
Free memory: 3.95GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x7c54b10
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 2 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:05.0
Total memory: 4.00GiB
Free memory: 3.95GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x65bb1d0
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 3 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:06.0
Total memory: 4.00GiB
Free memory: 3.95GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y N N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   N Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2:   N N Y N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3:   N N N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GRID K520, pci bus id: 0000:00:04.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GRID K520, pci bus id: 0000:00:05.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GRID K520, pci bus id: 0000:00:06.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0
/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: GRID K520, pci bus id: 0000:00:04.0
/job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: GRID K520, pci bus id: 0000:00:05.0
/job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: GRID K520, pci bus id: 0000:00:06.0
I tensorflow/core/common_runtime/direct_session.cc:255] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0
/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: GRID K520, pci bus id: 0000:00:04.0
/job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: GRID K520, pci bus id: 0000:00:05.0
/job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: GRID K520, pci bus id: 0000:00:06.0

WARNING:tensorflow:From tf.py:183 in get_session.: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
gradients_3/add_grad/Shape_1: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/add_grad/Shape_1: (Const)/job:localhost/replica:0/task:0/gpu:0
gradients_3/add_2_grad/Shape_1: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/add_2_grad/Shape_1: (Const)/job:localhost/replica:0/task:0/gpu:0
gradients_3/gradients_2/Mean_1_grad/Tile_grad/range: (Range): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/gradients_2/Mean_1_grad/Tile_grad/range: (Range)/job:localhost/replica:0/task:0/gpu:0
gradients_3/gradients_2/Mean_1_grad/truediv_grad/Shape_1: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/gradients_2/Mean_1_grad/truediv_grad/Shape_1: (Const)/job:localhost/replica:0/task:0/gpu:0
gradients_3/gradients_2/logistic_loss_1_grad/Sum_grad/Size: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/gradients_2/logistic_loss_1_grad/Sum_grad/Size: (Const)/job:localhost/replica:0/task:0/gpu:0
gradients_3/gradients_2/logistic_loss_1_grad/Sum_grad/range: (Range): /job:localhost/replica:0/task:0/gpu:0

It does seem to be placing the tasks on the GPU, however I still see almost entirely 0% GPU-Util in the nvidia-smi monitor.

The pandas dataframe is of course in memory. Is there any other IO that could be impacting this process?

Edit 2: I captured the log_device_placement logs for both the fast and slow data sets. They are identical, even though in one case the GPU usage is 25%, and the other 0%. Really scratching my head now....

Could you turn on log_device_placement and figure out which op placements are different? — Allen Lavoie
Thanks Allen, I'll try this out as soon as I'm back in the office tomorrow — MarkNS

MarkNS MarkNS · Accepted Answer · 2017-02-10T15:53:14

The cause of the slowness was the memory layout of the ndarray backing the DataFrame. The s2 data was column-major meaning that each row of features and target was not contiguous.

This operation changes the memory layout:

s2_train_data = s2_train_data.values.copy(order='C')

and now the GPU is running at 26% utilisation. Happy days :)

Tensorflow not using GPU for one dataset, where it does for a very similar dataset

1 Answers