Byte Embedding in mLSTM Conceptual Struggle

Question

I am trying to follow the OpenAI "Sentiment Neuron" experiment by reading through the PyTorch code posted on Github for training the model from scratch.

One thing I am not understanding is the byte-level embedding used in the code. I understood that the LSTM outputs a probability distribution for the value of the next byte and I assumed the "embedding" would just be a one-hot encoding of the byte value.

Looking at the code, I see that the model's input goes through a (trainable) dense embedding before going into the model. Confusingly, the output of the loss is computed between the model output and the upcoming byte value, which is not embedded.

My questions are:
1. How is the cross entropy loss computed? Does nn.CrossEntropyLoss take the softmax of its input and expand the target into a one-hot vector "under the hood"?

2. If we want to generate byte strings from this LSTM, how do we embed the output to feed back into the model for the next step? Do we embed the highest likelihood or take a softmax of the output and use some sort of weighted embedding?

I'm new to LSTM and I'm trying to learn but I just don't get it! I appreciate any help!

Davis Yoshida Davis Yoshida · Accepted Answer · 2018-05-20T21:07:21

Even though the same symbols are being used for input and output, it's perfectly acceptable to have different representations used at each end. Cross entropy is a function of two probability distributions. In this case, the two distributions are the softmax distribution given by the model, and a point mass on the "correct" byte.

For question 1, yes that is what is being done in terms of inputs and outputs (although the implementation might be optimized).

To answer question 2, the most common thing is to form the softmax distribution at each step, then sample from it.

Byte Embedding in mLSTM Conceptual Struggle

1 Answers