2
votes

As my inputs are of variable length , I need to pad them all to get them to same size so as to feed them to Bidirectional LSTM.

But, what difference can prepading make over postpadding.

for example:

 input [3,2,1,2]
 prepad [0,0,0,3,2,1,2]   
 postpad [3,2,1,2,0,0,0]

which varient helps in better gradient flow?

1
I don't know why this question has been downvoted. It's a good question. Maybe you should specifically point out that you're talking about Keras' Bidirectional layer.z0r
This issue on GitHub asks a similar question but it doesn't have an answer.z0r

1 Answers

3
votes

Usually a recurrent network has a higher emphasize on information it has seen last. Therefore, whether you should use pre- or post-padding highly depends on your data and problem.

Consider the following example: You have an encoder-decoder architecture. The encoder reads the data and outputs some fixed dimensional representation while the decoder should do the reverse. Now for the encoder it would make sense to pre-pad the input so it doesn't just read paddings at the end while forgetting the actual meaningful content it has seen before. For the decoder on the other hand, post-padding might be better as it should probably learn to produce some end-of-sequence-token at the end and ignore the rest (the paddings) that follows anyway.

What is now better suited for a Bidirectional-LSTM is hard to say and might also depend on the actual problem in the end. In the most simple case, it shouldn't really matter.