I am training neural networks using the great Keras library for Python. I got curious about one behaviour I don't understand.
Often even slighly bigger models converge to bigger error than smaller ones.
Why does this happen? I would expect bigger model just to train longer, but converge to smaller or same error.
I hyperoptimized the model, tried different amounts of dropout regularization and let it train for sufficient time. I experimented with models about 10-20k parameters, 5 layers, 10M data samples and 20-100 epochs with decreasing LR. Models contained Dense and sometimes LSTM layers.