0
votes

So i'm struggling in an information retrieval concept. It's in regards to the cosine similarity of the documents given a query.

I am manipulating about 1000 files to generate a term frequency matrix with [docID x terms].

I have this matrix generated but i'm stumped on what to do with the query and generating cosine similarity from it.

I am given a query with terms that I am supposed to parse through the corpus, which I have done. And generated a vector which is where all the docIDs contain at least one of the words.

So i'm supposed to compute all these row vectors in terms of cosine similarity?

Example:

The query is a list with the column location and term in the term frequency matrix

The OccurenceVector is an array of where all the documents that include the words in the query

Query = [[2796, 'crystalline'], [6714, 'lens'], [5921, 'including'], [5566, 'humans']]
OccurrenceVector = array([ 13,  14,  15,  72,  79, 138, 142, 164, 165, 166, 167, 168, 169,
   170, 171, 172, 180, 181, 182, 183, 184, 185, 186, 211, 212, 213,
   499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511,
   512, 513])

My thought process is like this:

  1. Term Frequency matrix of [docID x terms] (row x column)

  2. Receive a query with terms against the corpus

  3. Retrieve a vector with all the docIDs these terms occur in

  4. Retrieve each row respective to that retrieved docID

  5. Compute the cosine similarity between all rows retrieved?

Is this the correct way of thinking about computing cosine similarity with a multidimensional array like this?

1

1 Answers

1
votes

I suggest you to have a look at 6th Chapter of IR Book (especially at 6.3).

You need to treat the query as a document, as well. Construct a vector for your query as you construct it for your documents. Then in order to get the best hits, you need to compute similarity against all the document vectors for your query.

Remember that you can also pick a document vector, and compute its similarity with all other documents in your corpus. By this way you can compute the similarity between your documents.

Hope this helps.

Cheers