0
votes

I am trying to implement End to End Memory Network using Pytorch and BabI dataset. The network architecture is :

MemN2N (
(embedding_A): Embedding(85, 120, padding_idx=0)
(embedding_B): Embedding(85, 120, padding_idx=0)
(embedding_C): Embedding(85, 120, padding_idx=0)
(match): Softmax ()
)

85 is the vocabulary size and 120 is embedding size. Loss function is cross entropy and optimizer is RmsProp. The results is

Epoch    Train Loss  Test Loss  TrainAcc TestAcc
10       0.608         11.213       1.0     0.99
20       0.027         11.193       1.0     0.99
30       0.0017        11.740       1.0     0.99
40       0.0006        12.190       1.0     0.99
50       5.597e-05     12.319       1.0     0.99
60       3.366-05      12.379       1.0     0.99
70       2.72e-05      12.361       1.0     0.99
80       2.64e-05      12.333       1.0     0.99
90       2.63e-05      12.329       1.0     0.99
100      2.63e-05      12.329       1.0     0.99
110      2.63e-05      12.329       1.0     0.99
120      2.63e-05      12.329       1.0     0.99
Final TrainAcc TestAcc
    1.0     0.999

I know the accuracy is good, but I wonder the behaviour of the test loss. Since training loss decreases, the test loss increases. The calculation is the same for each loss value. Shouldn't it decrease too? I used Task 1 to display, but the behaviour is the same with other tasks.

Do you have any idea about this behaviour?

1

1 Answers

3
votes

When training loss continues to decrease but test loss starts to increase, that is the moment you are starting to overfit, that means that your network weights are fitting the data you are training on better and better, but this extra fitting will not generalize to new unseen data. This means that that is the moment you should stop training.

You are embedding 80 words in 120 dimensions, so you have no information bottle neck at all, you have much too many dimensions for only 80 words. You have so many free parameters you can fit anything, even noise. Try changing 120 for 10 and probably you will not overfit anymore. If you try using 2 dimensions instead of 120, then you will probably underfit.

Overfitting: When your model has enough capacity to fit particularities of your training data which doesn't generalize to new data from the same distribution.

Underfitting: When your model does not have enough capacity to fit even your training data (you cannot bring your training loss "close" to zero).

In your case, I am guessing that your model becomes over-confident on your training data (output probabilities too close to 1 or 0) which is justified in the case of the training data but which is too confident for your test data (or any other data you didn't train on).