40
votes

I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the Doc2Vec model in gensim. My dataset is in the form of a pandas dataset which has each document stored as a string on each line. This is the code I have so far

import gensim, re
import pandas as pd

# TOKENIZER
def tokenizer(input_string):
    return re.findall(r"[\w']+", input_string)

# IMPORT DATA
data = pd.read_csv('mp_1002_prepd.txt')
data.columns = ['merged']
data.loc[:, 'tokens'] = data.merged.apply(tokenizer)
sentences= []
for item_no, line in enumerate(data['tokens'].values.tolist()):
    sentences.append(LabeledSentence(line,[item_no]))

# MODEL PARAMETERS
dm = 1 # 1 for distributed memory(default); 0 for dbow 
cores = multiprocessing.cpu_count()
size = 300
context_window = 50
seed = 42
min_count = 1
alpha = 0.5
max_iter = 200

# BUILD MODEL
model = gensim.models.doc2vec.Doc2Vec(documents = sentences,
dm = dm,
alpha = alpha, # initial learning rate
seed = seed,
min_count = min_count, # ignore words with freq less than min_count
max_vocab_size = None, # 
window = context_window, # the number of words before and after to be used as context
size = size, # is the dimensionality of the feature vector
sample = 1e-4, # ?
negative = 5, # ?
workers = cores, # number of cores
iter = max_iter # number of iterations (epochs) over the corpus)

# QUERY BASED DOC RANKING ??

The part where I am struggling is in finding documents that are most similar/relevant to the query. I used the infer_vector but then realised that it considers the query as a document, updates the model and returns the results. I tried using the most_similar and most_similar_cosmul methods but I get words along with a similarity score(I guess) in return. What I want to do is when I enter a search string(a query), I should get the documents (ids) that are most relevant along with a similarity score(cosine etc). How do I get this part done?

1
Does your query exists in the dataset? If so you can use the sentence_tag to find similar sentences. If not you could create a infer vector (after gensim 0.12.4) and query with it. Both using model.docvecs.most_similar()umutto
@umutto my query is a string for example- customer segmentation. Customer and segmentation both exist in the vocabulary. By sentence_tag you mean the tag we pass in LabeledSentence, right? If so, then I have used document id(basically a number 1,2,3...num_docs) as the tag. I used infer_vector but that wasn't helpful because it considers the query as the document, updates the model weights and returns similar documents. I don't want to update the model every time I pass a query.Lastly, model.docvecs.most_similar() can be used, but it needs a vector to find the most similar docsClock Slave
@umutto So basically the question comes down to how do I get a vector representation of the query without altering the model.Clock Slave
The infer method will ignore any words it does not have on vocsb and should not update weights afaik. passing the inffered vector to the most_similar function shd indeed give you back tags of similar doc. Have you tried that? What happens? Have you saved and loaded the model again?Luke Barker
@ClockSlave currently I don't think there is any other way to get the vector representations. If you have a query that exists in your vocabulary than you can use their tag (document id in your case) to calculate similarity or to get their vectors. But I don't think infer vector would update the weights. You may see some difference results from same query due to non-deterministic nature of some algorithms used (negative sampling, dbow=1 etc...). But that does not mean the model is altered.umutto

1 Answers

52
votes

You need to use infer_vector to get a document vector of the new text - which does not alter the underlying model.

Here is how you do it:

tokens = "a new sentence to match".split()

new_vector = model.infer_vector(tokens)
sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity

Edit:

Here is an example of how the underlying model does not change after infer_vec is called.

import numpy as np

words = "king queen man".split()

len_before =  len(model.docvecs) #number of docs

#word vectors for king, queen, man
w_vec0 = model[words[0]]
w_vec1 = model[words[1]]
w_vec2 = model[words[2]]

new_vec = model.infer_vector(words)

len_after =  len(model.docvecs)

print np.array_equal(model[words[0]], w_vec0) # True
print np.array_equal(model[words[1]], w_vec1) # True
print np.array_equal(model[words[2]], w_vec2) # True

print len_before == len_after #True