I am preparing a Doc2Vec model using tweets. Each tweet's word array is considered as a separate document and is labeled as "SENT_1", SENT_2" etc.
taggeddocs = [] for index,i in enumerate(cleaned_tweets): if len(i) > 2: # Non empty tweets sentence = TaggedDocument(words=gensim.utils.to_unicode(i).split(), tags=[u'SENT_{:d}'.format(index)]) taggeddocs.append(sentence) # build the model model = gensim.models.Doc2Vec(taggeddocs, dm=0, alpha=0.025, size=20, min_alpha=0.025, min_count=0) for epoch in range(200): if epoch % 20 == 0: print('Now training epoch %s' % epoch) model.train(taggeddocs) model.alpha -= 0.002 # decrease the learning rate model.min_alpha = model.alpha # fix the learning rate, no decay
I wish to find tweets similar to a given tweet, say "SENT_2". How?
I get labels for similar tweets as:
sims = model.docvecs.most_similar('SENT_2') for label, score in sims: print(label)
It prints as:
SENT_4372 SENT_1143 SENT_4024 SENT_4759 SENT_3497 SENT_5749 SENT_3189 SENT_1581 SENT_5127 SENT_3798
But given a label, how do I get original tweet words/sentence? E.g. what are the tweet words of, say, "SENT_3497". Can I query this to Doc2Vec model?