I am training an encoder-decoder LSTM in keras for text summarization and the CNN dataset with the following architecture
Picture of bidirectional encoder-decoder LSTM
I am pretraining the word embedding (of size 256) using skip-gram and
I then pad the input sequences with zeros so all articles are of equal length
I put a vector of 1's in each summary to act as the "start" token
Use MSE, RMSProp, tanh activation in the decoder output later
Training: 20 epochs, batch_size=100, clip_norm=1,dropout=0.3, hidden_units=256, LR=0.001, training examples=10000, validation_split=0.2
- The network trains and training and validation MSE go down to 0.005, however during inference, the decoder keeps producing a repetition of a few words that make no sense and are nowhere near the real summary.
My question is, is there anything fundamentally wrong in my training approach, the padding, loss function, data size, training time so that the network fails to generalize?