Java: How to use TF-IDF to compute similarity of two documents?

Question

My goals is to find a similarity value between two documents (collections of words). I have already found several answers like this SO post or this SO post which provide Python libraries that achieve this, but I have trouble understanding the approach and making it work for my use case.

If I understand correctly, TF-IDF of a document is computed with respect to a given term, right? That's how I interpret it from the Wikipedia article on this: "tf-idf...is a numerical statistic that is intended to reflect how important a word is to a document".

In my case, I don't have a specific search term which I want to compare to the document, but I have two different documents. I assume I need to first compute vectors for the documents, and then take the cosine between these vectors. But all the answers I found with respect to constructing these vectors always assume a search term, which I don't have in my case.

Can't wrap my head around this, any conceptual help or links to Java libraries that achieve this would be highly appreciated.

Run a term extraction before, and once you have the list of terms with their frequencies for both corpora, calculate the cosine similarity. — Wiktor Stribiżew
@Wiktor Stribiżew: Thanks for the suggestion. So I extract the terms of both documents into a list. And then for each of those terms, I compute the tf-idf values for each of the two documents, which gives me two vectors, from which i can compute the cosine similarity. Am I understanding this correctly? — gmazlami
Yes, basically that is how it is done. Based on the term frequency, get the vectors, TF-IDF, and calculate the cosine similarity. Also, make sure you use stemming to normalize word forms you extracted to reduce noise. — Wiktor Stribiżew

Wiktor Stribiżew Wiktor Stribiżew · Accepted Answer · 2016-11-24T10:07:04

I suggest running terminology extraction first, together with their frequencies. Note that stemming can also be applied to the extracted terms to avoid noise in during the subsequent cosine similarity calculation. See Java library for keywords extraction from input text SO thread for more help and ideas on that.

Then, as you yourself mention, for each of those terms, you will have to compute the TF-IDF values, get the vectors and compute the cosine similarity.

When calculating TF-IDF, mind that 1 + log(N/n) (N standing for the total number of corpora and n standing for the number of corpora that include the term) formula is better since it avoids the issue when TF is not 0 and IDF turns out equal to 0.

Java: How to use TF-IDF to compute similarity of two documents?

1 Answers