1
votes

Say you're trying to find the most similar document in a corpus to a given search query. I've seen some examples create TF-IDF vectors that are the length of the given query, and some create TF-IDF vectors that use every term in the corpus.

Is one of these ways the "correct" way to do it?

1
are you asking about the difference between a sparse array and a dense array?maxymoo
I don't believe so. For the first example, if you're searching for "funny cats", I've seen some examples create tfidf vectors to compare similarity that are of length 2 (tfidf value of funny and tfidf value of cats, if it's in the document). Other examples will create a tfidf vector of every word in the corpus, for each document that has either "funny" or "cats" in the document. So the length of the vector would be the total number of words in the corpus. It's definitely possible I'm misunderstanding something. ThanksTim S
so suppose your corpus is ["funny dogs", "boring cats", "funny bunnies"], with the vocabulary ["funny", "boring", "dogs", "cats", "bunnies"]. the tfidf matrix of the documents is [[0.5,0,1,0,0],[0,0.5,0,1,0],[0.5,0,0,0,1]]. If you tell me that "funny bunnies" has the vector [0.5,1], how do I know what words in my vocabulary those values correspond to?maxymoo
Well, I'm thinking that there is a database table words (id, word), documents (id, contents), and document_words (id, word_id, document_id). So if someone had a search query of "funny bunnies", what would the tfidf vector used with cosine similarity look like for the search query (and similarly for the document vectors)? [0.5,1], or [0.5,0,0,0,1]?Tim S

1 Answers

1
votes

Suppose your corpus is ["funny dogs", "boring cats", "funny bunnies"], with the vocabulary ["funny", "boring", "dogs", "cats", "bunnies"]. the TF-IDF matrix of the documents is [[0.5,0,1,0,0],[0,0.5,0,1,0],[0.5,0,0,0,1]].

You have two ways of representing a new document (your search query):

1: Dense vector.

The dense TF-IDF vector of "funny cats" is [1,0,0,1,0]. The formula for cosine similarity with each of the documents (I'm actually just doing the dot product; I've left out the denominator since every document vector has the same norm) is

cos("funny cats", "funny dogs") ~ 0.5*0.5+0*0+0*1+1*0+0*0 = 0.25
cos("funny cats", "boring cats") ~ 0.5*0+0*0.5+0*0+1*1+0*0 = 1
cos("funny cats", "funny bunnies") ~ 0.5*0.5+0*0+0*0+1*0+0*1 = 0.25

so the closest match is "boring cats" because "cats" is a rarer and presumably more informative word than "funny".

2: Sparse vector.

The sparse TF-IDF vector of "funny cats" is [[0,0.25], [3,1]]. The calculation is

cos("funny cats", "funny dogs") ~ 0.5*0.5+1+1 = 0.25
cos("funny cats", "boring cats") ~ 0.5+1*1 = 1
cos("funny cats", "funny bunnies") ~ 0.5*0.5+1*0 = 0.25

Basically you're doing less operations because you're only looking at the nonzero values. This may or may not be important depending on how many words are in your vocabulary.