3
votes

I am training an encoder-decoder LSTM in keras for text summarization and the CNN dataset with the following architecture

Picture of bidirectional encoder-decoder LSTM

  1. I am pretraining the word embedding (of size 256) using skip-gram and

  2. I then pad the input sequences with zeros so all articles are of equal length

  3. I put a vector of 1's in each summary to act as the "start" token

  4. Use MSE, RMSProp, tanh activation in the decoder output later

  5. Training: 20 epochs, batch_size=100, clip_norm=1,dropout=0.3, hidden_units=256, LR=0.001, training examples=10000, validation_split=0.2

  6. The network trains and training and validation MSE go down to 0.005, however during inference, the decoder keeps producing a repetition of a few words that make no sense and are nowhere near the real summary.

My question is, is there anything fundamentally wrong in my training approach, the padding, loss function, data size, training time so that the network fails to generalize?

1

1 Answers

1
votes
  • Your model looks ok, except for the loss function. I can't figure out how MSE is applicable to word prediction. Cross-entropy loss looks like a natural choice here.

  • Generated word repetition can be caused by the way the decoder works at inference time: you should not simply select the most probable word from the distribution, but rather sample from it. This will give more variance to the generated text. Start looking at beam search.

  • If I were to pick a single technique to boost sequence to sequence model performance, it's certainly attention mechanism. There are lots of post about it, you can start with this one, for example.