How to extract words used for Doc2Vec

Question

I am preparing a Doc2Vec model using tweets. Each tweet's word array is considered as a separate document and is labeled as "SENT_1", SENT_2" etc.

taggeddocs = []
for index,i in enumerate(cleaned_tweets):
    if len(i) > 2: # Non empty tweets
        sentence = TaggedDocument(words=gensim.utils.to_unicode(i).split(), tags=[u'SENT_{:d}'.format(index)])
        taggeddocs.append(sentence)

# build the model
model = gensim.models.Doc2Vec(taggeddocs, dm=0, alpha=0.025, size=20, min_alpha=0.025, min_count=0)

for epoch in range(200):
    if epoch % 20 == 0:
        print('Now training epoch %s' % epoch)
    model.train(taggeddocs)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay

I wish to find tweets similar to a given tweet, say "SENT_2". How?

I get labels for similar tweets as:

sims = model.docvecs.most_similar('SENT_2')
for label, score in sims:
    print(label)

It prints as:

SENT_4372
SENT_1143
SENT_4024
SENT_4759
SENT_3497
SENT_5749
SENT_3189
SENT_1581
SENT_5127
SENT_3798

But given a label, how do I get original tweet words/sentence? E.g. what are the tweet words of, say, "SENT_3497". Can I query this to Doc2Vec model?

gojomo gojomo · Accepted Answer · 2017-01-19T03:54:21

Gensim's Word2Vec/Doc2Vec models don't store the corpus data – they only examine it, in multiple passes, to train up the model. If you need to retrieve the original texts, you should populate your own lookup-by-key data structure, such as a Python dict (if all your examples fit in memory).

Separately, in recent versions of gensim, your code will actually be doing 1,005 training passes over your taggeddocs, including many with a nonsensically/destructively negative alpha value.

By passing it into the constructor, you're telling the model to train itself, using your parameters and defaults, which include a default number of iter=5 passes.
You then do 200 more loops. Each call to train() will do the default 5 passes. And by decrementing alpha from 0.025 by 0.002 199 times, the last loop will use an effective alpha of 0.025-(200*0.002)=-0.375 - a negative value essentially telling the model to make a large correction in the opposite direction of improvement each training-example.

Just use the iter parameter to choose the desired number of passes. Let the class manage the alpha changes itself. If supplying the corpus when instantiating the model, no further steps are necessary. But if you don't supply the corpus at instantiation, you'll need to do model.build_vocab(tagged_docs) once, then model.train(tagged_docs) once.

How to extract words used for Doc2Vec

1 Answers