So i'm struggling in an information retrieval concept. It's in regards to the cosine similarity of the documents given a query.
I am manipulating about 1000 files to generate a term frequency matrix with [docID x terms].
I have this matrix generated but i'm stumped on what to do with the query and generating cosine similarity from it.
I am given a query with terms that I am supposed to parse through the corpus, which I have done. And generated a vector which is where all the docIDs contain at least one of the words.
So i'm supposed to compute all these row vectors in terms of cosine similarity?
Example:
The query is a list with the column location and term in the term frequency matrix
The OccurenceVector is an array of where all the documents that include the words in the query
Query = [[2796, 'crystalline'], [6714, 'lens'], [5921, 'including'], [5566, 'humans']]
OccurrenceVector = array([ 13, 14, 15, 72, 79, 138, 142, 164, 165, 166, 167, 168, 169,
170, 171, 172, 180, 181, 182, 183, 184, 185, 186, 211, 212, 213,
499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511,
512, 513])
My thought process is like this:
Term Frequency matrix of [docID x terms] (row x column)
Receive a query with terms against the corpus
Retrieve a vector with all the docIDs these terms occur in
Retrieve each row respective to that retrieved docID
Compute the cosine similarity between all rows retrieved?
Is this the correct way of thinking about computing cosine similarity with a multidimensional array like this?