6
votes

I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs.

The example of training data (length>10)

docs = ['This is a sentence', 'This is another sentence', ....]

with some pre-treatment

doc_=[d.strip().split(" ") for d in doc]
doc_tagged = []
for i in range(len(doc_)):
  tagd = TaggedDocument(doc_[i],str(i))
  doc_tagged.append(tagd)

tagged docs

TaggedDocument(words=array(['a', 'b', 'c', ..., ],
  dtype='<U32'), tags='117')

fit a doc2vec model

model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(doc_tagged)
model.train(doc_tagged, total_examples= model.corpus_count, epochs= model.iter)

then i get the final model

len(model.docvecs)

the result is 10...

I tried other datasets (length>100, 1000) and got same result of len(model.docvecs). So, my question is: How to use model.docvecs to get full vectors? (without using model.infer_vector) Is model.docvecs designed to provide all training docvecs?

1

1 Answers

11
votes

The bug is in this line:

tagd = TaggedDocument(doc[i],str(i))

Gensim's TaggedDocument accepts a sequence of tags as a second argument. When you pass a string '123', it's turned into ['1', '2', '3'], because it's treated as a sequence. As a result, all of the documents are tagged with just 10 tags ['0', ..., '9'], in various combinations.

Another issue: you're defining doc_ and never actually using it, so your documents will be split incorrectly as well.

Here's the proper solution:

docs = [doc.strip().split(' ') for doc in docs]
tagged_docs = [doc2vec.TaggedDocument(doc, [str(i)]) for i, doc in enumerate(docs)]