0
votes

I am new to NLP, how to find the similarity between 2 sentences and also how to print scores of each word. And also how to implement the gensim word2Vec model.

Try this code: here my two sentences :

sentence1="I am going to India"
sentence2=" I am going to Bharat"
from gensim.models import word2vec
import numpy as np


words1 = sentence1.split(' ')
words2 = sentence2.split(' ')

#The meaning of the sentence can be interpreted as the average of its words
sentence1_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:
    sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
    count += 1
sentence1_meaning /= count

sentence2_meaning = word2vec(words2[0])
count = 1
for w in words2[1:]:
    sentence2_meaning = np.add(sentence2_meaning, word2vec(w))
    count += 1
sentence2_meaning /= count

#Similarity is the cosine between the vectors
similarity = np.dot(sentence1_meaning, sentence2_meaning)/(np.linalg.norm(sentence1_meaning)*np.linalg.norm(sentence2_meaning))
2
It seems that your code is missing an important step: the word2vec model should be either trained from scratch or loaded from some file. Why don't you start with a tutorial on Gensim? radimrehurek.com/gensim/models/word2vec.htmlDavid Dale

2 Answers

1
votes

You can train the model and use the similarity function to get the cosine similarity between two words.

Here's a simple demo:

from gensim.models import Word2Vec
from gensim.test.utils import common_texts

model = Word2Vec(common_texts, 
                 size = 500, 
                 window = 5, 
                 min_count = 1, 
                 workers = 4)

word_vectors = model.wv

word_vectors.similarity('computer', 'computer')

The output will be 1.0, of course, which indicates 100% similarity.

0
votes

After your from gensim.models import word2vec, word2vec is a Python module – not a function that you can call as word2vec(words1[0]) or word2vec(w).

So your code isn't even close to approaching this correctly, and you should review docs/tutorials which demonstrate the proper use of the gensim Word2Vec class & supporting methods, then mimic those.

As @david-dale mentions, there's a basic intro in the gensim docs for Word2Vec:

https://radimrehurek.com/gensim/models/word2vec.html

The gensim library also bundles within its docs/notebooks directory a number of Jupyter notebooks demonstrating various algorithms & techniques. The notebook word2vec.ipynb shows basic Word2Vec usage; you can also view it via the project's source code repository at...

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb

...however, it's really best to run as a local notebook, so you can step through the execution cell-by-cell, and try different variants yourself, perhaps even adapting it to use your data instead.

When you reach that level, note that:

  • these models require far more than just a few sentences as training - so ideally you'd either have (a) many sentences from the same domain as those you're comparing, so that the model can learn words in those contexts; (b) a model trained from a compatible corpus, which you then apply to your out-of-corpus sentences.

  • using the average of all the word-vectors in a sentence is just one relatively-simple way to make a vector for a longer text; there are many other more-sophisticated ways. One alternative very similar to Word2Vec is the 'Paragraph Vector' algorithm also available in gensim as the class Doc2Vec.