Differences in encoder - decoder models between Keras and Pytorch

Question

There seem to be significant, fundamental differences in construction of encoder-decoder models between keras and pytorch. Here is keras' enc-dec blog and here is pytorch's enc-dec blog.

Some differences I noticed are the following:

Keras' model directly feeds input to LSTM layer. Whereas Pytorch uses an embedding layer for both the encoder and decoder.
Pytorch uses an embedding layer with no activation in the encoder but uses relu activation for the embedding layer in the decoder.

Given these observations, my questions are the following:

My understanding is the following, is it correct? The embedding layer is not strictly required but it helps in finding a better and denser representation of the input. It is optional and you can still build a good model without the embedding layer (dependent on the problem). This is why Keras chose not to use it in this particular example. Is this a sound reason or is there more to the story?
Why use an activation for the embedding layer in the decoder but not the encoder?
Why use 'relu' as the activation instead of 'tanh', etc for the embedding layer? What's the intuition here? I've only seen 'relu' applied to data that has spatial relation, not temporal relation.

Wasi Ahmad Wasi Ahmad · Accepted Answer · 2020-06-09T10:13:54

You have a wrong understanding of encoder-decoder models. First of all, please note Keras and Pytorch are two deep learning frameworks, while encoder-decoder is a type of neural network architecture. So, you need to understand how encoder-decoder works in the first place and then revise their architecture as per your need. Now, let me come back to your questions.

Embedding layer converts one-hot encoding representations into low-dimensional vector representations. For example, we have a sentence I love programming. We want to translate this sentence into German using an encoder-decoder network. So, the first step is to first convert the words in the input sentence into a sequence of vector representations, and this can be done using an embedding layer. Please note, the use of Keras or Pytorch doesn't matter. You can think, how would you give a natural language sentence as input to an LSTM? Obviously, you first need to convert them into vectors.
There is no such rule that you should use an activation layer in the embedding layer for the decoder, but not in the encoder. Remember, activation functions are non-linear functions. So, applying a non-linearity has different consequences but it has nothing to do with the encoder-decoder framework.
Again, the choice of activation function depends on other factors, not on encoder or decoder or a specific type of neural network architecture. I suggest you read the characteristics of the popular activation functions that are used in neural networks. Also, do not come into conclusions after observing a few use cases. Such conclusions are dangerous.

Differences in encoder - decoder models between Keras and Pytorch

1 Answers