2
votes

I wonder if I understood correctly the idea of using world embedding in natural language processing. I want to show you how I perceive it and ask whether my interpretation is correct.

Let's assume that we want to predict whether sentence is positive or negative. We will use a pre-trained word embedding prepared on a very large text corpus with dimension equals 100. It means that for each word we have 100 values. Our file looks in this way:

...
    new -0.68538535 -0.08992791 0.8066535 other 97 values ... 
    man -0.6401568 -0.05007627 0.65864474 ...
    many 0.18335487 -0.10728102 0.468635 ...
    doesnt 0.0694685 -0.4131108 0.0052553082 ...
...

Obviously we have test and train set. We will use sklearn model to fit and predict results. Our train set looks in this way:

1 This is positive and very amazing sentence.
0 I feel very sad.

And test set contains sentences like:

In my opinion people are amazing.

I have mainly doubts related to the preprocessing of input data. I wonder whether it should be done in this way:

We do for all sentences for instance tokenization, removing of stop words, lowercasing etc. So for our example we get:

'this', 'is', 'positive', 'very', 'amazing', 'sentence'
'i', 'feel', 'very', 'sad'

'in', 'my', 'opinion', 'people', 'amazing'

We use pad_sequences:

1,2,3,4,5,6
7,8,4,9

10,11,12,13,5

Furthermore we check the length of the longest sentence in both train set and test set. Let's assume that in our case the maximum length is equals to 10. We need to have all vectors of the same length so we fill the remaining fields with zeros.

1,2,3,4,5,0,0,0,0,0
6,7,4,8,0,0,0,0,0,0

10,11,12,13,5,0,0,0,0,0

Now the biggest doubt - we assign values from our word embedding word2vec file to all words from the prepared vectors from the training set and the test set.

Our word embedding word2vec file looks like this:

...
    in -0.039903056 0.46479827 0.2576446 ...
    ...
    opinion 0.237968 0.17199863 -0.23182874...
    ...
    people 0.2037858 -0.29881874 0.12108547 ...
    ...
    amazing 0.20736384 0.22415389 0.09953516 ...
    ...
    my 0.46468195 -0.35753986 0.6069699 ...
...

And for instance for 'in', 'my', 'opinion', 'people', 'amazing' equals to 10,11,12,13,5,0,0,0,0,0 we get the table of tables like this: [-0.039903056 0.46479827 0.2576446 ...],[0.46468195 -0.35753986 0.6069699 ...],[0.237968 0.17199863 -0.23182874...],[0.2037858 -0.29881874 0.12108547 ...],[0.20736384 0.22415389 0.09953516 ...],0,0,0,0

Finally our train set looks in this way:

x             y
1 [0.237968 0.17199863 -0.23182874...],[next 100 values],[next 100 values],[...],[...],0,0,0,0,0,
0 [...],[...],[...],[...],[...],[...],[...],0,0,0
1 [...],[...],[...],[...],[...],0,0,0,0,0
 ...

And test set looks in this way:

                   y
[100 values],[...],[...],[...],0,0,0,0,0,0
 ...

In the last step we train our model using for example sklearn model:

 LogisticRegression().fit(values from y column of train set, values from x column of train set)

Then we predict data:

 LogisticRegression().predict(values from y column of test set)

Above I described the whole process with the specified steps that give me the most doubts. I am asking you to indicate me the mistakes I have made in my reasoning and their explanation. I want to be sure that I understood it correctly. Thank you in advance for your help.

1
You first arrange the features (different unique words) in columns (maybe alphabetical order), like a bag of words, and then for each sample fill the appropriate column. In this way, you will not get all padded 0s in the end, but some will be present according to their position in features which is not present in that column.Vivek Kumar

1 Answers

3
votes

Logistic regression accepts flat 2d matrices for X inputs, but you are trying to feed a strange rugged structure into it - it would not work.

I would suggest a simpler solution - just use an average embedding of each words in a sentence as an input for logistic regression. In this case, this input will have regular shape and will be relatively small. If you want to improve this formula, you can make this average weighted (e.g. by TF-IDF).

If you want to keep modelling sentenes as sequences of embedding, you need a more complex model than logistic regression - e.g. a recurrent neural network.