There seem to be significant, fundamental differences in construction of encoder-decoder models between keras and pytorch. Here is keras' enc-dec blog and here is pytorch's enc-dec blog.
Some differences I noticed are the following:
- Keras' model directly feeds input to LSTM layer. Whereas Pytorch uses an embedding layer for both the encoder and decoder.
- Pytorch uses an embedding layer with no activation in the encoder but uses relu activation for the embedding layer in the decoder.
Given these observations, my questions are the following:
- My understanding is the following, is it correct? The embedding layer is not strictly required but it helps in finding a better and denser representation of the input. It is optional and you can still build a good model without the embedding layer (dependent on the problem). This is why Keras chose not to use it in this particular example. Is this a sound reason or is there more to the story?
- Why use an activation for the embedding layer in the decoder but not the encoder?
- Why use 'relu' as the activation instead of 'tanh', etc for the embedding layer? What's the intuition here? I've only seen 'relu' applied to data that has spatial relation, not temporal relation.