0
votes

I have a sample of ~60,000 documents. We've hand coded 700 of them as having a certain type of content. Now we'd like to find the "most similar" documents to the 700 we already hand-coded. We're using gensim doc2vec and I can't quite figure out the best way to do this.

Here's what my code looks like:

cores = multiprocessing.cpu_count()

model = Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0, 
        epochs=10, workers=cores, dbow_words=1, train_lbls=False)

all_docs = load_all_files() # this function returns a named tuple
random.shuffle(all_docs)
print("Docs loaded!")
model.build_vocab(all_docs)
model.train(all_docs, total_examples=model.corpus_count, epochs=5)

I can't figure out the right way to go forward. Is this something that doc2vec can do? In the end, I'd like to have a ranked list of the 60,000 documents, where the first one is the "most similar" document.

Thanks for any help you might have! I've spent a lot of time reading the gensim help documents and the various tutorials floating around and haven't been able to figure it out.

EDIT: I can use this code to get the documents most similar to a short sentence:

token = "words associated with my research questions".split()
new_vector = model.infer_vector(token)
sims = model.docvecs.most_similar([new_vector])
for x in sims:
    print(' '.join(all_docs[x[0]][0]))

If there's a way to modify this to instead get the documents most similar to the 700 coded documents, I'd love to learn how to do it!

3
Have you tried model.docvecs.most_similar? Would be good for you to include what you have tried from the available resources. - de1
My understanding is that most_similar() returns the documents most similar to a query, which is different from the documents most similar to my 700 "good" matches. If I'm confused, I'd appreciate you informing me! - Academic Researcher

3 Answers

0
votes

Your general approach is reasonable. A few notes about your setup:

  • you'd have to specify epochs=10 in your train() call to truly get 10 training passes – and 10 or more is most common in published work
  • sample-controlled downsampling helps speed training and often improves vector quality as well, and the value can become more aggressive (smaller) with larger datasets
  • train_lbls is not a parameter to Doc2Vec in any recent gensim version

There are several possible ways to interpret and pursue your goal of "find the 'most similar' documents to the 700 we already hand-coded". For example, for a candidate document, how should its similarity to the set-of-700 be defined - as a similarity to one summary 'centroid' vector for the full set? Or as its similarity to any one of the documents?

There are a couple ways you could obtain a single summary vector for the set:

  • average their 700 vectors together

  • combine all their words into one synthetic composite document, and infer_vector() on that document. (But note: texts fed to gensim's optimized word2vec/doc2vec routines face an internal implementation limit of 10,000 tokens – excess words are silently ignored.)

In fact, the most_similar() method can take a list of multiple vectors as its 'positive' target, and will automatically average them together before returning its results. So if, say, the 700 document IDs (tags used during training) are in the list ref_docs, you could try...

sims = model.docvecs.most_similar(positive=ref_docs, topn=0)

...and get back a ranked list of all other in-model documents, by their similarity to the average of all those positive examples.

However, the alternate interpretation, that a document's similarity to the reference-set is its highest similarity to any one document inside the set, might be better for your purpose. This could especially be the case if the reference set itself is varied over many themes – and thus not well-summarized by a single average vector.

You'd have to compute these similarities with your own loops. For example, roughly:

sim_to_ref_set = {}
for doc_id in all_doc_ids:
    sim_to_ref_set[doc_id] = max([model.docvecs.similarity(doc_id, ref_id) for ref_id in ref_docs])
sims_ranked = sorted(sim_to_ref_set.items(), key=lambda it:it[1], reverse=True)

The top items in sims_ranked would then be those most-similar to any item in the reference set. (Assuming the reference-set ids are also in all_doc_ids, the 1st 700 results will be the chosen docs again, all with a self-similarity of 1.0.)

0
votes

n_similarity looks like the function you want, but it seem to only work with samples in the training set.

Since you have only 700 documents to crosscheck with, using sklearn shouldn't post performance issues. Simply get the vectors of your 700 documents and use sklearn.metrics.pairwise.cosine_similarity and then find the closest match. Then you can find the ones with the highest similarity (e.g. using `np.argmax). Some un-tested code to illustrate that:

from sklearn.metrics.pairwise import cosine_similarity

reference_vectors = ... # your vectors to the 700 documents
new_vector = ... # inferred as per your last example
similarity_matrix = cosine_similarity([new_vector], reference_vectors)
most_similar_indices = np.argmax(similarity_matrix, axis=-1)

That can also be modified to implement a method like n_similarity for a number of unseen documents.

0
votes

I think you can do what you want with TaggedDocument. The basic use case is to just add a unique tag (document id) for every document, but here you will want to add a special tag to all 700 of your hand-selected documents. Call it whatever you want, in this case I call it TARGET. Add that tag to only your 700 hand-tagged documents, omit it for the other 59,300.

TaggedDocument(words=gensim.utils.simple_preprocess(document),tags=['TARGET',document_id])

Now, train your Doc2Vec.

Next, you can use model.docvecs.similarity to score the similarity between your unlabeled documents and the custom tag.

model.docvecs.similarity(document_id,'TARGET')

And then just sort that. n_similarity and most_similar I don't think are going to be appropriate for what you want to do.

60,000 documents is not very many for Doc2Vec, but maybe you will have good luck.