The problem to solve: Given a sentence, return the intent behind it (Think chatbot)
Reduced example dataset (Intent on the left of dict):
data_raw = {"mk_reservation" : ["i want to make a reservation",
"book a table for me"],
"show_menu" : ["what's the daily menu",
"do you serve pizza"],
"payment_method" : ["how can i pay",
"can i use cash"],
"schedule_info" : ["when do you open",
"at what time do you close"]}
I have stripped down the sentences with spaCy, and tokenized each word by using the word2vec algorithm provided by the gensim library.
This is what resulted from the use of word2vec model GoogleNews-vectors-negative300.bin:
[[[ 5.99331968e-02 6.50703311e-02 5.03010787e-02 ... -8.00536275e-02
1.94782894e-02 -1.83010306e-02]
[-2.14406010e-02 -1.00447744e-01 6.13847338e-02 ... -6.72588721e-02
3.03986594e-02 -4.14126664e-02]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
...
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]]
[[ 4.48647663e-02 -1.03907576e-02 -1.78682189e-02 ... 3.84555124e-02
-2.29179319e-02 -2.05144612e-03]
[-5.39291985e-02 -9.88398306e-03 4.39085700e-02 ... -3.55276838e-02
-3.66208404e-02 -4.57760505e-03]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
...
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]]]
- This is a List of sentences, and each sentence is a list of words ([sentences[sentence[word]]])
- Each sentence (list) must be of size 10 words (I am padding the remaining with zeroes)
- Each word (list) has 300 elements (word2vec dimensions)
By following some tutorials i transformed this to a TensorDataset.
At this moment, i am very confused on how to use the word2vec and probably i have just been wasting time, as of now i believe the embeddings layer from an LSTM configuration should be composed by importing the word2vec model weights using:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)
word_embeddings = nn.Embedding.from_pretrained(weights)
This is not enough as pytorch is saying it does not accept embeddings where indices are not INT type .
EDIT: I found out that importing the weight matrix from gensim word2vec is not straightforward, one has to import the word_index table as well.
As soon as i fix this issue i'll post it here.