0
votes

The problem to solve: Given a sentence, return the intent behind it (Think chatbot)

Reduced example dataset (Intent on the left of dict):

data_raw    = {"mk_reservation" : ["i want to make a reservation",
                                   "book a table for me"],
               "show_menu"      : ["what's the daily menu",
                                   "do you serve pizza"],
               "payment_method" : ["how can i pay",
                                   "can i use cash"],
               "schedule_info"  : ["when do you open",
                                   "at what time do you close"]}

I have stripped down the sentences with spaCy, and tokenized each word by using the word2vec algorithm provided by the gensim library.

This is what resulted from the use of word2vec model GoogleNews-vectors-negative300.bin:

[[[ 5.99331968e-02  6.50703311e-02  5.03010787e-02 ... -8.00536275e-02
    1.94782894e-02 -1.83010306e-02]
  [-2.14406010e-02 -1.00447744e-01  6.13847338e-02 ... -6.72588721e-02
    3.03986594e-02 -4.14126664e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[ 4.48647663e-02 -1.03907576e-02 -1.78682189e-02 ...  3.84555124e-02
   -2.29179319e-02 -2.05144612e-03]
  [-5.39291985e-02 -9.88398306e-03  4.39085700e-02 ... -3.55276838e-02
   -3.66208404e-02 -4.57760505e-03]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]]
  • This is a List of sentences, and each sentence is a list of words ([sentences[sentence[word]]])
  • Each sentence (list) must be of size 10 words (I am padding the remaining with zeroes)
  • Each word (list) has 300 elements (word2vec dimensions)

By following some tutorials i transformed this to a TensorDataset.

At this moment, i am very confused on how to use the word2vec and probably i have just been wasting time, as of now i believe the embeddings layer from an LSTM configuration should be composed by importing the word2vec model weights using:

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)    
word_embeddings = nn.Embedding.from_pretrained(weights)

This is not enough as pytorch is saying it does not accept embeddings where indices are not INT type .

EDIT: I found out that importing the weight matrix from gensim word2vec is not straightforward, one has to import the word_index table as well.

As soon as i fix this issue i'll post it here.

2

2 Answers

2
votes

You don't need neither a neural network nor word embeddings. Use parsed trees with NLTK, where intents are Verbs V acting on entities (N) in a given utterance:

Phrase

To classify a sentence, then you can use a Neural Net. I personally like BERT in fast.ai. Once again, you won't need embeddings to run the classification, and you can do it in multilanguage:

Fast.ai_BERT_ULMFit

Also, you can use Named Entity Recognition if you are working on a chatbot, to guide conversations.

1
votes

If you have enough training data, you may not need fancy neural networks (or even explicit word-vectorization). Just try basic text-classification algorithms (for example from scikit-learn) against basic text representations (such as a simple bag-of-words or bag-of-character n-grams).

If those don't work, or fail when confronted with novel words, then you might try fancier text vectorization options. For example, you might replace unknown words with the nearest-known-word from a large word2vec model. Or representing queries as averages-of-word-vectors – which is likely a better choice than creating giant concatenations of fixed length with zero-padding. Or use other algorithms for modeling the text, like 'Paragraph Vector' (Doc2Vec in gensim) or deeper neural-network modeling (which requires lots of data & training time).

(If you have or can acquire lots of domain-specific training data, training word-vectors on that text will likely give you more appropriate word-vectors than reusing those from GoogleNews. Those vectors were trained on professional news stories from a corpus circa 2013, which will have a very different set of word-spellings and prominent word-senses than what seems to be your main interest, user-typed queries.)