I have a list of tagged sentences. I transformed each of them in the following way:
- For each word, get the relative one-hot encoding form (a vector of dimension
input_dim
); - Insert a pre-padding as explained in the example below;
- Split each sentence in
len(sentence)
sub-sentences, using a window of sizetime_steps
(to get the context for the prediction of the next word).
For example, using time_steps=2
, a single sentence ["this", "is",
"an", "example"]
is transformed in:
[
[one_hot_enc("empty_word"), one_hot_enc("empty_word")],
[one_hot_enc("empty_word"), one_hot_enc("this")],
[one_hot_enc("this"), one_hot_enc("is")],
[one_hot_enc("is"), one_hot_enc("an")],
]
At the end, considering the sub-sentences as an unique list, the shape of the train data X_train
is (num_samples, time_steps, input_dim)
, where:
input_dim
: the size of my vocabulary;time_steps
: the length of sequence to use into LSTM;num_samples
: the number of samples (sub-sentences);
Now, I want to use an Embedding
layer, in order to map each word to a smaller an continuous dimensional space, and an LSTM
, in which I use the contexts build as above.
I tried something like this:
model = Sequential()
model.add(InputLayer(input_shape=(time_steps, input_dim)))
model.add(Embedding(input_dim, embedding_size, input_length=time_steps))
model.add(LSTM(32))
model.add(Dense(output_dim))
model.add(Activation('softmax'))
But gives me the following error:
ValueError: Input 0 is incompatible with layer lstm_1: expected ndim=3, found ndim=4
What I missing? There is some logical error in what I'm trying to do?