Significantly different "weights" and "bias" of two NN trained using same data

Question

I recently got introduced to the magical world of neural networks. I started following the Neural Networks and Deep Learning which implement a NN to recognize handwritten digits. It implements a 3 layer network( 1 input, 1 hidden, and 1 output) and trained using MNIST data set.

I just found that the weights matrix of two NN having a similar layer [784,30,10] architecture and trained using the same data set is very different. The same is true for the bias matrix.

General intuition says that since we are using multiple epochs and randomizing the data at each epoch, the weight matrix of both NN should converge to similar values. But it turns out to be very different. Would could be the reason for the same?

Here is the first few weights of NN1:

[array([[-1.2129184 , -0.08418661, -1.58413842, ...,  0.14350188,
          1.49436597, -1.71864906],
        [ 0.25485346, -0.1795214 ,  0.14175609, ...,  0.4222159 ,
          1.28005992, -1.17403326],
        [ 1.09796094,  0.66119858,  1.12603969, ...,  0.23220572,
         -1.66863656,  0.02761243],.....

Here is the first few weights of NN2, having same number of layers and trained using same training data, epochs and eta.

[array([[-0.87264811,  0.34475347, -0.04876076, ..., -0.074056  ,
          0.10218085, -0.50177084],
        [-1.96657944, -0.35619652,  1.10898861, ..., -0.53325862,
         -1.52680967,  0.26800431],
        [-1.24731848,  0.13278103, -1.70306514, ...,  0.07964225,
         -0.88724451, -0.40311485],
        ...,

There is a lot of randomness involved here. Two important types: 1) how weights are initialized before training and 2) how the training examples are shuffled. Certain kinds of randomness can be fixed by setting random seeds, depending on the framework / libraries you are using, but some operations might still be non-deterministic. Bottom line: never assume that training twice leads to the same results. — Mathias Müller

BlackBear BlackBear · Accepted Answer · 2020-05-12T11:12:57

General intuition says that since we are using multiple epochs and randomizing the data at each epoch, the weight matrix of both NN should converge to similar values

Unfortunately this is not true. This is because the loss landscape of neural networks is very complex, with lots of local minima that generalize quite well. Because of the random nature of the initialization and the training procedure, you are essentially guaranteed to converge to a different set of parameters with good performance.

Also note that randomness is not sufficient to have different results. For example, linear regression would always converge to the same parameters regardless of the initial values and order of the examples. Convergence to the same parameters is guaranteed only for convex loss functions.

Significantly different "weights" and "bias" of two NN trained using same data

2 Answers