I am attempting to build an LSTM model for text generation. In keras, my model would look something like the following:
model= Sequential()
model.add( Embedding(vocab_size, embedding_dim) )
model.add( LSTM( units=embedding_dim, input_shape=(None, embedding_dim), return_sequences=True) )
model.add( LSTM( units=embedding_dim, input_shape=(None, embedding_dim), return_sequences=True) )
model.add( Dense(vocab_size, activation='softmax') )
model.compile( optimizer='adam', loss='categorical_crossentropy')
I understand the benefits of an embedding layer for LSTM models: reduced memory size of input array, similar variables get mapped to close areas in latent space, etc. This allows me to pass an array of categories directly to my LSTM, without the need for a one hot encoding. Consider the following categorical dataset with a vocab_size=9:
X= [ [1,2,3], [4,5,6], [7,8,9] ]
My input to the embedding layer would be
X= [ [1,2], [4,5], [7,8] ]
My question is regarding the shape of the target vector Y. With a categorial cross entropy loss, I am still forced to one hot encode Y. Directly, I would need to one hot encode the following vector:
Y= [ [2,3], [5,6], [8,9] ]
It is strange to me that I can get away with not one hot encoding X, but still need to one hot encode Y. This seems to run counter to the memory use arguments I have read for using an embedding layer, as I am still forced to one hot encode Y which in theory could be very large for large vocab_size.
Is my understanding of the necessity to one hot encode Y correct, or are there other tricks I can use to avoid this situation?