1
votes

I am a bit confused regarding an aspect of Doc2Vec. Basically, I am not sure if what I do makes sense. I have the following dataset :

train_doc_0      --> label_0
    ...               ...
train_doc_99     --> label_0
train_doc_100    --> label_1
    ...               ...
train_doc_199    --> label_1
    ...               ...
    ...               ...
train_doc_239999 --> label_2399

eval_doc_0
    ...
eval_doc_29

Where train_doc_n is a short document, belonging to some label. There are 2400 labels, with 100 training documents per label. eval_doc_0 are evaluation documents where I would like to predict their label in the end (using a classifier).

I train a Doc2Vec model with these training documents & labels. Once the model is trained, I reproject each of the original training document as well as my evaluation documents (the ones I would like to classify in the end) into the model's space using infer_vector.

The resulting is a matrix :

X_train (240000,300) # doc2vec vectors for training documents
y_train (240000,)    # corresponding labels
y_eval  (30, 300)    # doc2vec vectors for evaluation documents

My problem is the following : If I run a simple cross validation on X_train and y_train, I have a decent accuracy. Once I try to classify my evaluation documents (even, using only 50 randomly sampled labels) I have a super bad accuracy, which makes me question my way of approaching this problem.

I followed this tutorial for the training of documents.

Does my approach make sense, especially with reprojecting all the training documents using infer_vector ?

1

1 Answers

1
votes

I don't see anything blatantly wrong.

Are the evaluation documents similar to the training documents in length, vocabulary, etc? Ideally, they'd be a randomly-chosen subset of all available labeled examples. (If quite different, that might be a reason why cross-validation versus held-out-evaluation accuracy varies.)

When training the Doc2Vec model, are you giving each document a single unique ID as the only entry of its tags? Or are you using the label_n labels as the tags of your training examples? Or perhaps both? (Any of those are defensible choices, though I've found mixing known-labels into the usually 'unsupervised' Doc2Vec training, making it semi-supervised, often helps the mdoels' vectors become more useful as input to later explicitly-supervised classifiers.)

When I get unprecedented 'super-bad' accuracy in an unexpected step, often it's because some erroneous shuffling or re-ordering of the test examples has occurred – randomizing the real relationships. So it's worth doubling-checking for that, in code and by looking at a few examples in detail.

Re-inferring vectors for examples used in training, rather than simply asking for the trained-up vectors retained in the model, sometimes results in better vectors. However, many have observed that different-than-default parameters to infer_vector(), especially many-more steps and perhaps a starting alpha closer to that used during training, may improve results. (Also, inference seems to work better in fewer steps in the simpler PV-DBOW, dm=0, mode. PV-DM, dm=1, may especially require more steps.)

The tutorial you link shows a practice, calling train() multiple times while adjusting alpha yourself, that's generally unnecessary and error-prone – and specifically isn't likely to be doing the right thing in the latest gensim versions. You can leave the default alpha/min_alpha in place, and supply a preferred iter value during Doc2Vec initialization - and then one call to train() will automatically do that many passes, and glide the learning-rate down properly. And since the default iter is 5, if you don't set it, every call to train() is doing 5 passes - so doing your own external loop of 10 would mean 50 passes, and the code at that tutorial, with two calls to train() per loop for some odd reason, would mean 100 passes.