I'm seeing strange results when loading a saved model that was trained on multiple GPUs into a single GPU model. I'm operating in a shared environment so I'm doing training on 4 GPUs but running tests using a single GPU.
What I'm seeing is the tests will return materially different results when run on (i) a single GPU, and (ii) 4 GPUs. For example, here is the output of the validation step for the model that is ultimately selected (I'm using a multi-GPU model checkpoint with early stopping):
Epoch 9:: Sensitivity: 0.8317 - Specificity: 0.9478 - Avg. Sn/Sp
0.8897 - Acc: 0.9289 - PPV: 0.7555 - NPV: 0.9667 - F1: 0.7918 - ROC AUC: 0.8897 - Matrix: [1016 56 35 173]
Here is the result when the model is tested against the validation data using 4 GPUs (load the model on the CPU using with tf.device and call multi_gpu_model):
Metric _base
------------- -------
acc 0.93
auc 0.881
f1 0.804
ppv 0.804
npv 0.958
sensitivity 0.804
specificity 0.958
Confusion matrices [tn, fp, fn, tp]
-----------------------------------
_base : [1017 45 45 185]
And here is the result when running the same test against the same data using only 1 GPU (simply load the model using load_model); which is consistently producing better classifiers:
Metric _base
------------- -------
acc 0.974
auc 0.946
f1 0.92
ppv 0.936
npv 0.982
sensitivity 0.905
specificity 0.988
Confusion matrices [tn, fp, fn, tp]
-----------------------------------
_base : [1069 13 20 190]
Software versions: python 3.5.2, keras 2.1.3, tensorflow 1.5.0 (I'm building a current version environment before opening an issue)
Hardware: 4 x Tesla P100, CUDA 9.0.176, CudNN 7
Does anyone have an idea of what's going on and, more importantly, how I can reproduce the effect?