There's an elegant way to do what you need.
Problem with your solution is that:
- the size of the input is large:
(batch_size, MAX_SEQUENCE_LENGTH, dim) and may not fit in memory.
- You won't be able to train and update the word vectors as per your task
You can instead get away with just: (batch_size, MAX_SEQUENCE_LENGTH). The keras embedding layer allows you to pass in a word index and get a vector. So, 42 -> Embedding Layer -> [3, 5.2, ..., 33].
Conveniently, gensim's w2v model has a function get_keras_embedding which creates the needed embedding layer for you with the trained weights.
gensim_model = # train it or load it
embedding_layer = gensim_model.wv.get_keras_embedding(train_embeddings=True)
embedding_layer.mask_zero = True # No need for a masking layer
model = Sequential()
model.add(embedding_layer) # your embedding layer
model.add(Bidirectional(
LSTM(num_lstm, dropout=0.5, recurrent_dropout=0.4, return_sequences=True))
)
But, you have to make sure the index for a word in the data is the same as the index for the word2vec model.
word2index = {}
for index, word in enumerate(model.wv.index2word):
word2index[word] = index
Use the above word2index dictionary to convert your input data to have the same index as the gensim model.
For example, your data might be:
X_train = [["hello", "there"], ["General", "Kenobi"]]
new_X_train = []
for sent in X_train:
temp_sent = []
for word in sent:
temp_sent.append(word2index[word])
# Add the padding for each sentence. Here I am padding with 0
temp_sent += [0] * (MAX_SEQUENCE_LENGTH - len(temp_sent))
new_X_train.append(temp_sent)
X_train = numpy.as_array(new_X_train)
Now you can use X_train and it will be like: [[23, 34, 0, 0], [21, 63, 0, 0]]
The Embedding Layer will map the index to that vector automatically and train it if needed.
I think this is the best way of doing it but I'll dig into how gensim wants it to be done and update this post if needed.