3
votes

I used gensim to build a word2vec embedding of my corpus. Currently I'm converting my (padded) input sentences to the word vectors using the gensim model. This vectors are used as input for the model.

model = Sequential()
model.add(Masking(mask_value=0.0, input_shape=(MAX_SEQUENCE_LENGTH, dim)))
model.add(Bidirectional(
    LSTM(num_lstm, dropout=0.5, recurrent_dropout=0.4, return_sequences=True))
)
...
model.fit(training_sentences_vectors, training_labels, validation_data=validation_data)

Are there any drawbacks using the word vectors directly without a keras embedding layer?

I'm also currently adding additional (one-hot encoded) tags to the input tokens by concatenating them to each word vector, does this approach make sense?

2

2 Answers

1
votes

In your current setup, the drawback will be that you will not be able to set your word vectors to be trainable. You will not be able to fine tune your model for your task.

What I mean by this is that Gensim has only learned the "Language Model". It understands your corpus and its contents. However, it does not know how to optimize for whatever downstream task you are using keras for. Your model's weights will help to fine tune your model, however you will likely experience an increase in performance if you extract the embeddings from gensim, use them to initialize a keras embedding layer, and then pass in indexes instead of word vectors for your input layer.

1
votes

There's an elegant way to do what you need.

Problem with your solution is that:

  1. the size of the input is large: (batch_size, MAX_SEQUENCE_LENGTH, dim) and may not fit in memory.
  2. You won't be able to train and update the word vectors as per your task

You can instead get away with just: (batch_size, MAX_SEQUENCE_LENGTH). The keras embedding layer allows you to pass in a word index and get a vector. So, 42 -> Embedding Layer -> [3, 5.2, ..., 33].

Conveniently, gensim's w2v model has a function get_keras_embedding which creates the needed embedding layer for you with the trained weights.

gensim_model = # train it or load it
embedding_layer = gensim_model.wv.get_keras_embedding(train_embeddings=True)
embedding_layer.mask_zero = True  # No need for a masking layer

model = Sequential()
model.add(embedding_layer) # your embedding layer
model.add(Bidirectional(
    LSTM(num_lstm, dropout=0.5, recurrent_dropout=0.4, return_sequences=True))
)

But, you have to make sure the index for a word in the data is the same as the index for the word2vec model.

word2index = {}
for index, word in enumerate(model.wv.index2word):
    word2index[word] = index

Use the above word2index dictionary to convert your input data to have the same index as the gensim model.

For example, your data might be:

X_train = [["hello", "there"], ["General", "Kenobi"]]

new_X_train = [] 
for sent in X_train:
    temp_sent = []
    for word in sent:
        temp_sent.append(word2index[word])
    # Add the padding for each sentence. Here I am padding with 0
    temp_sent += [0] * (MAX_SEQUENCE_LENGTH - len(temp_sent))
    new_X_train.append(temp_sent)

X_train = numpy.as_array(new_X_train)

Now you can use X_train and it will be like: [[23, 34, 0, 0], [21, 63, 0, 0]] The Embedding Layer will map the index to that vector automatically and train it if needed.

I think this is the best way of doing it but I'll dig into how gensim wants it to be done and update this post if needed.