Tf-Idf using cosine similarity for document similarity of almost similar sentence

Question

I am Using tf-idf with cosine similarity to calculate description(sentence) similarity

Input string:

    3/4x1/2x3/4 blk mi tee

Below are the sentences among which i need to find sentence similar to input string

      smith-cooper&reg; 33rt1 reducing pipe tee 3/4 x 1/2 x 3/4 in npt 150 lb malleable iron black
      smith-cooper&reg; 33rt1 reducing pipe tee 1 x 1/2 x 3/4 in npt 150 lb malleable iron black
      smith-cooper&reg; 33rt1 reducing pipe tee 1-1/4 x 1 x 3/4 in npt 150 lb malleable iron black 
      smith-cooper&reg; 33rt1 reducing pipe tee 1-1/2 x 3/4 x 1-1/2 in npt 150 lb malleable iron black
      smith-cooper&reg; 33rt1 reducing pipe tee 1-1/2 x 1-1/4 x 1 in npt 150 lb malleable iron black 
      smith-cooper&reg; 33rt1 reducing pipe tee 2 x 2 x 3/4 in npt 150 lb malleable iron black 
      smith-cooper&reg; 33rt1 reducing pipe tee 2 x 1-1/2 x 1-1/4 in npt 150 lb malleable iron black
      smith-cooper&reg; 33rt1 reducing pipe tee 2-1/2 x 2 x 2 in npt 150 lb malleable iron black
      smith-cooper&reg; 33rt1 reducing pipe tee 3 x 3 x 2 in npt 150 lb malleable iron black

As the sentences are almost similar, I am using tf-idf approach which give low score to words that appear in all document( Idf ) and give more score to unique words which helps to find the similar document easier.

is there any approach that works better than this?

Taylor Wood Taylor Wood · Accepted Answer · 2017-10-19T15:43:15

There are certainly other approaches such as latent semantic analysis, but what will work best totally depends on your data/corpus. In my experience, TF-IDF is a good starting point. More sophisticated approaches may underperform TF-IDF, or provide an negligible improvement relative to their complexity.

Something to experiment with using TF-IDF is different sized n-grams, and other pre-processing strategies for your corpus. Given your example, you may not want to tokenize your words based on word-boundary-splits; maybe you want to consider some of those sentence components as a single term e.g. 3/4 x 1/2 x 3/4 as a single term. I'd experiment with different sized n-grams first.

In your example, the sentences are identical except for the measurements/dimensions. If this sample is representative, you may want to put more thought into how to measure distances between those measurements.

Tf-Idf using cosine similarity for document similarity of almost similar sentence

1 Answers