Bigger neural network converges to bigger error than smaller

Question

I am training neural networks using the great Keras library for Python. I got curious about one behaviour I don't understand.

Often even slighly bigger models converge to bigger error than smaller ones.

Why does this happen? I would expect bigger model just to train longer, but converge to smaller or same error.

I hyperoptimized the model, tried different amounts of dropout regularization and let it train for sufficient time. I experimented with models about 10-20k parameters, 5 layers, 10M data samples and 20-100 epochs with decreasing LR. Models contained Dense and sometimes LSTM layers.

What do you mean by "bigger neural network"? Adding layers, adding neurons or something else? — Manngo
@Manngo By bigger model I mean more parameters. Mostly I tried adding neurons into a constant number of layers. — Lefty
Can you give an example as well as the details on your environment (versions, back-end)? Also interesting how much the error increases for a certain increase of layer size? — Manngo

Umberto Umberto · Accepted Answer · 2018-02-01T10:09:09

What I noticed in my tests is that increasing the number of parameters require sometime to review how you prepare your input data or how you initialize your weights. I found that often increasing the number of parameteres requires to initialize the weights differently (meaning initializing with smaller values) or you need to normalize the input data (I guess you have done that), or even dividing them by a constant factor to make them smaller. Sometime reducing the learning rate helps, since your cost function will become more complex with more parameters and it may happen that the learning rate that before was working fine is too big for your new case. But is very difficult to give a precise answer.

Something else: what do you mean with bigger error? Are you doing classification or regression? In addition are you talking about error on the train set or the dev/test sets? That is a big difference. It may well be that (if you are talking about the dev/test sets) that you are overfitting your data and therefore gets a bigger error on the dev/tests sets (bias-variance tradeoff)... Can you give us more details?

Bigger neural network converges to bigger error than smaller

1 Answers