Saved multi-GPU trained model loaded into single-GPU; inconsistent results

Question

I'm seeing strange results when loading a saved model that was trained on multiple GPUs into a single GPU model. I'm operating in a shared environment so I'm doing training on 4 GPUs but running tests using a single GPU.

What I'm seeing is the tests will return materially different results when run on (i) a single GPU, and (ii) 4 GPUs. For example, here is the output of the validation step for the model that is ultimately selected (I'm using a multi-GPU model checkpoint with early stopping):

Epoch 9:: Sensitivity: 0.8317 - Specificity: 0.9478 - Avg. Sn/Sp
0.8897 - Acc: 0.9289 - PPV: 0.7555 - NPV: 0.9667 - F1: 0.7918 - ROC AUC: 0.8897 - Matrix: [1016   56   35  173]

Here is the result when the model is tested against the validation data using 4 GPUs (load the model on the CPU using with tf.device and call multi_gpu_model):

   Metric      _base
------------- -------
     acc       0.93  
     auc       0.881
     f1        0.804
     ppv       0.804
     npv       0.958
 sensitivity   0.804
 specificity   0.958

Confusion matrices [tn, fp, fn, tp]
-----------------------------------
_base : [1017   45   45  185]

And here is the result when running the same test against the same data using only 1 GPU (simply load the model using load_model); which is consistently producing better classifiers:

       Metric      _base
    ------------- -------
         acc       0.974
         auc       0.946
         f1        0.92
         ppv       0.936
         npv       0.982
     sensitivity   0.905
     specificity   0.988
Confusion matrices [tn, fp, fn, tp]
-----------------------------------
_base : [1069   13   20  190]

Software versions: python 3.5.2, keras 2.1.3, tensorflow 1.5.0 (I'm building a current version environment before opening an issue)

Hardware: 4 x Tesla P100, CUDA 9.0.176, CudNN 7

Does anyone have an idea of what's going on and, more importantly, how I can reproduce the effect?

Chris Kirby Chris Kirby · Accepted Answer · 2018-03-20T00:49:23

I can no longer replicate this after rebuilding my virtualenv with the latest software versions. I spoke to one of the tech support folks and they mumbled something about tensorflow upgrades and specialized builds. I'm putting it down to gremlins.

Saved multi-GPU trained model loaded into single-GPU; inconsistent results

2 Answers