Tensorflow major difference in loss between machines

Question

I have written a Variational Auto-Encoder in Keras using Tensorflow as backend. As optimizer I use Adam, with a learning rate of 1e-4 and batch size 16. When I train the net on my Macbook's CPU (Intel Core i7), the loss value after one epoch (~5000 minibatches) is a factor 2 smaller than after the first epoch on a different machine running Ubuntu. For the other machine I get the same result on both CPU and GPU (Intel Xeon E5-1630 and Nvidia GeForce GTX 1080). Python and the libraries I'm using have the same version number. Both machines use 32 bit floating points. If I use a different optimizer (eg rmsprop), the significant difference between machines is still there. I'm setting np.random.seed to eliminate randomness.

My net outputs logits (I have linear activation in the output layer), and the loss function is tf.nn.sigmoid_cross_entropy_with_logits. On top of that, one layer has a regularizer (the KL divergence between its activation, which are params of a Gaussian distribution, and a zero mean Gauss).

What could be the cause of the major difference in loss value?

Can you provide more detail about your data and the training history after the first epoch? — petezurich

datwelk datwelk · Accepted Answer · 2017-06-13T20:18:16

The dataset I used was a single .mat file, created by using scipy's savemat and loaded with loadmat. It was created on my Macbook and distributed via scp to the other machines. It turned out that the issue was with this .mat file (I do not know exactly what though). I have switched away from the .mat file and everything is fine now.

Tensorflow major difference in loss between machines

1 Answers