I'm using gensim 3.0.1.
I have a list of TaggedDocument with unique labels of the form "label_17", but when I train Doc2Vec model, it somehow splits the labels to symbols, so the output for model.docvecs.doctags is the following:
{'0': Doctag(offset=5, word_count=378, doc_count=40),
'1': Doctag(offset=6, word_count=1330, doc_count=141),
'2': Doctag(offset=7, word_count=413, doc_count=50),
'3': Doctag(offset=8, word_count=365, doc_count=41),
'4': Doctag(offset=9, word_count=395, doc_count=41),
'5': Doctag(offset=10, word_count=420, doc_count=41),
'6': Doctag(offset=11, word_count=408, doc_count=41),
'7': Doctag(offset=12, word_count=426, doc_count=41),
'8': Doctag(offset=13, word_count=385, doc_count=41),
'9': Doctag(offset=14, word_count=376, doc_count=40),
'_': Doctag(offset=4, word_count=2009, doc_count=209),
'a': Doctag(offset=1, word_count=2009, doc_count=209),
'b': Doctag(offset=2, word_count=2009, doc_count=209),
'e': Doctag(offset=3, word_count=2009, doc_count=209),
'l': Doctag(offset=0, word_count=4018, doc_count=418)}
but in the initial list of tagged document each document has its own unique label.
The code for model training is the following:
model = Doc2Vec(size=300, sample=1e-4, workers=2)
print('Building Vocabulary')
model.build_vocab(data)
print('Training...')
model.train(data, total_words=total_words_count, epochs=20)
Therefore I can't index my documents like model.docvecs['label_17'] and get KeyError.
The same thing if I pass data to the constructor instead of building the vocabulary.
Why is this happening? Thanks.