I am using multiple GPUs (num_gpus = 4) for training one model with multiple towers. The model is training well on one set of GPUs: CUDA_VISIBLE_DEVICES = 0,1,2,3 while it gets OOM problem during the first graph evaluation with CUDA_VISIBLE_DEVICES = 0,1,4,5
Anyone has any ideas why this is happening?
Following options are used for creating a session
session_config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=False)
session_config.gpu_options.per_process_gpu_memory_fraction = 0.94
session_config.gpu_options.allow_growth=False
Batch size, is already super small, = 3
System information
Tensorflow 1.0 Cuda 8.0 Ubuntu 14.04.5 LTS All GPUs : GeForce GTX 1080
logs
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate (GHz) 1.7335 pciBusID 0000:07:00.0 Total memory: 7.92GiB Free memory: 7.81GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0xcc4593a0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate (GHz) 1.7335 pciBusID 0000:08:00.0 Total memory: 7.92GiB Free memory: 7.81GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0xd2404670 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 2 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate (GHz) 1.7335 pciBusID 0000:18:00.0 Total memory: 7.92GiB Free memory: 7.81GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0xd25591b0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 3 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate (GHz) 1.7335 pciBusID 0000:1c:00.0 Total memory: 7.92GiB Free memory: 7.81GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: Y Y Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: Y Y Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:07:00.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci bus id: 0000:08:00.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080, pci bus id: 0000:18:00.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080, pci bus id: 0000:1c:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 47441 get requests, put_count=8461 evicted_count=1000 eviction_rate=0.118189 and unsatisfied allocation rate=0.844839 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110 W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.33GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.08GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.08GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.98GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.98GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.17GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.68GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.86GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 2698 get requests, put_count=8709 evicted_c