How to use gensim with pytorch to create an intent classifier (With LSTM NN)?

Question

The problem to solve: Given a sentence, return the intent behind it (Think chatbot)

Reduced example dataset (Intent on the left of dict):

data_raw    = {"mk_reservation" : ["i want to make a reservation",
                                   "book a table for me"],
               "show_menu"      : ["what's the daily menu",
                                   "do you serve pizza"],
               "payment_method" : ["how can i pay",
                                   "can i use cash"],
               "schedule_info"  : ["when do you open",
                                   "at what time do you close"]}

I have stripped down the sentences with spaCy, and tokenized each word by using the word2vec algorithm provided by the gensim library.

This is what resulted from the use of word2vec model GoogleNews-vectors-negative300.bin:

[[[ 5.99331968e-02  6.50703311e-02  5.03010787e-02 ... -8.00536275e-02
    1.94782894e-02 -1.83010306e-02]
  [-2.14406010e-02 -1.00447744e-01  6.13847338e-02 ... -6.72588721e-02
    3.03986594e-02 -4.14126664e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[ 4.48647663e-02 -1.03907576e-02 -1.78682189e-02 ...  3.84555124e-02
   -2.29179319e-02 -2.05144612e-03]
  [-5.39291985e-02 -9.88398306e-03  4.39085700e-02 ... -3.55276838e-02
   -3.66208404e-02 -4.57760505e-03]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]]

This is a List of sentences, and each sentence is a list of words ([sentences[sentence[word]]])
Each sentence (list) must be of size 10 words (I am padding the remaining with zeroes)
Each word (list) has 300 elements (word2vec dimensions)

By following some tutorials i transformed this to a TensorDataset.

At this moment, i am very confused on how to use the word2vec and probably i have just been wasting time, as of now i believe the embeddings layer from an LSTM configuration should be composed by importing the word2vec model weights using:

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)    
word_embeddings = nn.Embedding.from_pretrained(weights)

This is not enough as pytorch is saying it does not accept embeddings where indices are not INT type .

EDIT: I found out that importing the weight matrix from gensim word2vec is not straightforward, one has to import the word_index table as well.

As soon as i fix this issue i'll post it here.

razimbres razimbres · Accepted Answer · 2019-12-01T01:08:39

You don't need neither a neural network nor word embeddings. Use parsed trees with NLTK, where intents are Verbs V acting on entities (N) in a given utterance:

To classify a sentence, then you can use a Neural Net. I personally like BERT in fast.ai. Once again, you won't need embeddings to run the classification, and you can do it in multilanguage:

Fast.ai_BERT_ULMFit

Also, you can use Named Entity Recognition if you are working on a chatbot, to guide conversations.

How to use gensim with pytorch to create an intent classifier (With LSTM NN)?

2 Answers