I have a sample of ~60,000 documents. We've hand coded 700 of them as having a certain type of content. Now we'd like to find the "most similar" documents to the 700 we already hand-coded. We're using gensim doc2vec and I can't quite figure out the best way to do this.
Here's what my code looks like:
cores = multiprocessing.cpu_count()
model = Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0,
epochs=10, workers=cores, dbow_words=1, train_lbls=False)
all_docs = load_all_files() # this function returns a named tuple
random.shuffle(all_docs)
print("Docs loaded!")
model.build_vocab(all_docs)
model.train(all_docs, total_examples=model.corpus_count, epochs=5)
I can't figure out the right way to go forward. Is this something that doc2vec can do? In the end, I'd like to have a ranked list of the 60,000 documents, where the first one is the "most similar" document.
Thanks for any help you might have! I've spent a lot of time reading the gensim help documents and the various tutorials floating around and haven't been able to figure it out.
EDIT: I can use this code to get the documents most similar to a short sentence:
token = "words associated with my research questions".split()
new_vector = model.infer_vector(token)
sims = model.docvecs.most_similar([new_vector])
for x in sims:
print(' '.join(all_docs[x[0]][0]))
If there's a way to modify this to instead get the documents most similar to the 700 coded documents, I'd love to learn how to do it!