3
votes

Assuming that I have a word similarity score for each pair of words in two sentences, what is a decent approach to determining the overall sentence similarity from those scores?

The word scores are calculated using cosine similarity from vectors representing each word.

Now that I have individual word scores, is it too naive to sum the individual word scores and divide by the total word count of both sentences to get a score for the two sentences?

I've read about further constructing vectors to represent the sentences, using the word scores, and then again using cosine similarity to compare the sentences. But I'm not familiar with how to construct sentence vectors from the existing word scores. Nor am I aware of what the tradeoffs are compared with the naive approach described above, which at the very least, I can easily comprehend. :).

Any insights are greatly appreciated.

Thanks.

2
By each pair of words, do you mean word1 in sentence A compared to word1 in sentence B, then compare word2 in A with word2 in B, etc.? Or is word1 in sentence A compared to each and every word in sentence B. Then the same for word2 in sentence A, and so on? Do you do this on all words, or with-stop-words-removed, or just nouns?Darren Cook
I do it with all the words. So the count of S1 x the count of S2 is the total number of comparisons.Scott Klarenbach

2 Answers

0
votes

What I ended up doing, was taking the mean of each set of vectors, and then applying cosine-similarity to the two means, resulting in a score for the sentences.

I'm not sure how mathematically sound this approach is, but I've seen it done in other places (like python's gensim).

0
votes

You'd better use contextual word embeddings(vector representations) for words.

Here is an approach to sentence similarities by pairwise word similarities: BERTScore.

enter image description here

You can check the math here.