This is more of a deep learning conceptual problem, and if this is not the right platform I'll take it elsewhere.
I'm trying to use a Keras LSTM sequential model to learn sequences of text and map them to a numeric value (a regression problem).
The thing is, the learning always converges too fast on high loss (both training and testing). I've tried all possible hyperparameters, and I have a feeling it's a local minima issue that causes the model's high bias.
My questions are basically :
- How to initialize weights and bias given this problem?
- Which optimizer to use?
- How deep I should extend the network (I'm afraid that if I use a very deep network, the training time will be unbearable and the model variance will grow)
- Should I add more training data?
Input and output are normalized with minmax.
I am using SGD with momentum, currently 3 LSTM layers (126,256,128) and 2 dense layers (200 and 1 output neuron)
I have printed the weights after few epochs and noticed that many weights are zero and the rest are basically have the value of 1 (or very close to it).

'adam'optimizer, it often finds its way automatically. But your answer cannot be given without many tests and details. It seems your learning rate may be too high, but that may not be the only possible cause. - Daniel Möller