I am a bit confused regarding an aspect of Doc2Vec. Basically, I am not sure if what I do makes sense. I have the following dataset :
train_doc_0 --> label_0
... ...
train_doc_99 --> label_0
train_doc_100 --> label_1
... ...
train_doc_199 --> label_1
... ...
... ...
train_doc_239999 --> label_2399
eval_doc_0
...
eval_doc_29
Where train_doc_n is a short document, belonging to some label. There are 2400 labels, with 100 training documents per label. eval_doc_0 are evaluation documents where I would like to predict their label in the end (using a classifier).
I train a Doc2Vec model with these training documents & labels. Once the model is trained, I reproject each of the original training document as well as my evaluation documents (the ones I would like to classify in the end) into the model's space using infer_vector.
The resulting is a matrix :
X_train (240000,300) # doc2vec vectors for training documents
y_train (240000,) # corresponding labels
y_eval (30, 300) # doc2vec vectors for evaluation documents
My problem is the following : If I run a simple cross validation on X_train and y_train, I have a decent accuracy. Once I try to classify my evaluation documents (even, using only 50 randomly sampled labels) I have a super bad accuracy, which makes me question my way of approaching this problem.
I followed this tutorial for the training of documents.
Does my approach make sense, especially with reprojecting all the training documents using infer_vector ?