Suppose your corpus is ["funny dogs", "boring cats", "funny bunnies"]
, with the vocabulary ["funny", "boring", "dogs", "cats", "bunnies"]
. the TF-IDF matrix of the documents is [[0.5,0,1,0,0],[0,0.5,0,1,0],[0.5,0,0,0,1]]
.
You have two ways of representing a new document (your search query):
1: Dense vector.
The dense TF-IDF vector of "funny cats" is [1,0,0,1,0]
. The formula for cosine similarity with each of the documents (I'm actually just doing the dot product; I've left out the denominator since every document vector has the same norm) is
cos("funny cats", "funny dogs") ~ 0.5*0.5+0*0+0*1+1*0+0*0 = 0.25
cos("funny cats", "boring cats") ~ 0.5*0+0*0.5+0*0+1*1+0*0 = 1
cos("funny cats", "funny bunnies") ~ 0.5*0.5+0*0+0*0+1*0+0*1 = 0.25
so the closest match is "boring cats" because "cats" is a rarer and presumably more informative word than "funny".
2: Sparse vector.
The sparse TF-IDF vector of "funny cats" is [[0,0.25], [3,1]]
. The calculation is
cos("funny cats", "funny dogs") ~ 0.5*0.5+1+1 = 0.25
cos("funny cats", "boring cats") ~ 0.5+1*1 = 1
cos("funny cats", "funny bunnies") ~ 0.5*0.5+1*0 = 0.25
Basically you're doing less operations because you're only looking at the nonzero values. This may or may not be important depending on how many words are in your vocabulary.
["funny dogs", "boring cats", "funny bunnies"]
, with the vocabulary["funny", "boring", "dogs", "cats", "bunnies"]
. the tfidf matrix of the documents is[[0.5,0,1,0,0],[0,0.5,0,1,0],[0.5,0,0,0,1]]
. If you tell me that "funny bunnies" has the vector[0.5,1]
, how do I know what words in my vocabulary those values correspond to? – maxymoowords
(id, word),documents
(id, contents), anddocument_words
(id, word_id, document_id). So if someone had a search query of "funny bunnies", what would the tfidf vector used with cosine similarity look like for the search query (and similarly for the document vectors)?[0.5,1]
, or[0.5,0,0,0,1]
? – Tim S