I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the Doc2Vec
model in gensim. My dataset is in the form of a pandas dataset which has each document stored as a string on each line. This is the code I have so far
import gensim, re
import pandas as pd
# TOKENIZER
def tokenizer(input_string):
return re.findall(r"[\w']+", input_string)
# IMPORT DATA
data = pd.read_csv('mp_1002_prepd.txt')
data.columns = ['merged']
data.loc[:, 'tokens'] = data.merged.apply(tokenizer)
sentences= []
for item_no, line in enumerate(data['tokens'].values.tolist()):
sentences.append(LabeledSentence(line,[item_no]))
# MODEL PARAMETERS
dm = 1 # 1 for distributed memory(default); 0 for dbow
cores = multiprocessing.cpu_count()
size = 300
context_window = 50
seed = 42
min_count = 1
alpha = 0.5
max_iter = 200
# BUILD MODEL
model = gensim.models.doc2vec.Doc2Vec(documents = sentences,
dm = dm,
alpha = alpha, # initial learning rate
seed = seed,
min_count = min_count, # ignore words with freq less than min_count
max_vocab_size = None, #
window = context_window, # the number of words before and after to be used as context
size = size, # is the dimensionality of the feature vector
sample = 1e-4, # ?
negative = 5, # ?
workers = cores, # number of cores
iter = max_iter # number of iterations (epochs) over the corpus)
# QUERY BASED DOC RANKING ??
The part where I am struggling is in finding documents that are most similar/relevant to the query. I used the infer_vector
but then realised that it considers the query as a document, updates the model and returns the results. I tried using the most_similar
and most_similar_cosmul
methods but I get words along with a similarity score(I guess) in return. What I want to do is when I enter a search string(a query), I should get the documents (ids) that are most relevant along with a similarity score(cosine etc). How do I get this part done?
model.docvecs.most_similar()
– umuttosentence_tag
you mean the tag we pass in LabeledSentence, right? If so, then I have used document id(basically a number 1,2,3...num_docs) as the tag. I usedinfer_vector
but that wasn't helpful because it considers the query as the document, updates the model weights and returns similar documents. I don't want to update the model every time I pass a query.Lastly,model.docvecs.most_similar()
can be used, but it needs a vector to find the most similar docs – Clock Slave