4
votes

I have a set of files and a query doc.My purpose is to return the most similar documents by comparing with query doc for each of the document.To use cosine similarity first i have to map the document strings to vectors.Also i have already created a tf-idf function that calculate for each of the document.

To get the index of the strings i have a function like that ;

def getvectorKeywordIndex(self, documentList):
    """ create the keyword associated to the position of the elements within the    document vectors """
    #Mapped documents into a single word string
    vocabularyString = " ".join(documentList)
    vocabularylist= vocabularyString.split(' ')
    vocabularylist= list(set(vocabularylist))
    print 'vocabularylist',vocabularylist
    vectorIndex={}
    offset=0
    #Associate a position with the keywords which maps to the dimension on the vector used to represent this word
    for word in vocabularylist:
        vectorIndex[word]=offset
        offset+=1
  print vectorIndex
  return vectorIndex,vocabularylist  #(keyword:position),vocabularylist

and for cosine similarity my function is that;

 def cosine_distance(self,index, queryDoc):

    vector1= self.makeVector(index)
    vector2= self.makeVector(queryDoc)

    return numpy.dot(vector1, vector2) / (math.sqrt(numpy.dot(vector1, vector1)) * math.sqrt(numpy.dot(vector2, vector2)))

TF-IDF is ;

def tfidf(self, term, key):

    return (self.tf(term,key) * self.idf(term))

My problem is that how can i create the makevector by using the index and vocabulary list and also tf-idf inside of this function. Any answer is welcome.

1

1 Answers

2
votes

You should pass vectorIndex to makeVector as well and use it to look up the indices for terms in documents and queries. Ignore terms that do not appear in vectorIndex.

Mind you, when dealing with documents you should really be using scipy.sparse matrices instead of Numpy arrays, or you'll quickly run out of memory.

(Alternatively, consider using the Vectorizer in scikit-learn which handles all of this for you, uses scipy.sparse matrices and computes tf-idf values. Disclaimer: I wrote parts of that class.)