1
votes

I am trying to find best hyperparameters for my trained doc2vec gensim model which takes a document as an input and create its document embeddings. My train data consists of text documents but it doesn't have any labels. i.e. I just have 'X' but not 'y'.

I found some questions here related to what I am trying to do but all of the solutions are proposed for supervised models but none for unsupervised like mine.

Here is the code where I am training my doc2vec model:

def train_doc2vec(
    self,
    X: List[List[str]],
    epochs: int=10,
    learning_rate: float=0.0002) -> gensim.models.doc2vec:

    tagged_documents = list()

    for idx, w in enumerate(X):
        td = TaggedDocument(to_unicode(str.encode(' '.join(w))).split(), [str(idx)])
        tagged_documents.append(td)

    model = Doc2Vec(**self.params_doc2vec)
    model.build_vocab(tagged_documents)

    for epoch in range(epochs):
        model.train(tagged_documents,
                    total_examples=model.corpus_count,
                    epochs=model.epochs)
        # decrease the learning rate
        model.alpha -= learning_rate
        # fix the learning rate, no decay
        model.min_alpha = model.alpha

    return model

I need suggestions on how to proceed and find best hyperparameters for my trained model using GridSearch or any suggestions about some other technique. Help is much appreciated.

1
Your loop calling train() multiple times is very broken, and will only get more broken once you start trying different combinations of epochs, alpha, and learning_rate. Where did you copy this logic from? - gojomo
Got it from my friends github repository. This model gives me 75% train accuracy. What else do you suggest ? How can I make this less broken ? And how can I tune Parameters? - Rajat
@gojomo i tried to remove the for loop and train model without it but I got a very bad accuracy (55%), but with that loop(running 10 time) I am getting 75%. - Rajat
Then your friends' github repo has a serious flaw & shouldn't be used as a model. Can you ask them where they got it? Call train() only once, with your desired number of epochs. The current code is a mess that (among other things) is actually doing 10*10 training passes and sends the learning-rate all-over-the-place (down and up again) during training. If it's helping, it's pure dumb luck – and something like (possibly) just using 100 epochs in non-broken code would do better. - gojomo
But, if more training (epochs) is hurting, that strongly suggests your model is 'overfitting' - the model is too large for your data, and thus is essentially 'memorizing' the idiosyncracies of your data to meet its training goals, and is thus becoming less useful/general for other tasks. Get more data, or shrink the model – for example by using a smaller vector-size dimensionality, and/or use a higher word min_count to discard more rare words. - gojomo

1 Answers

2
votes

Independently by the correctness of the code, I will try to answer to your question on how to perform a tuning of hyper-parameters. You have to start defining a set of hyper-parameters that will define your hyper-parameter grid search. For each set of hyper-parameters

Hset1=(par1Value1,par2Value1,...,par3Value1)

you train your model on the training set and you use an independent validation set to measure your accuracy (or whatever metrics you wish to use). You store this value (e.g. A_Hset1). When you do this for all the possible set of hyper-parameters you will have a set of measures

(A_Hset1,A_Hset2,A_Hset3...A_HsetK).

Each one of those measure tells you how good is your model for each set of hyper-parameters so your set of of optimal hyper-parameters

H_setOptimal= HsetX | A_setX=max(A_Hset1,A_Hset2,A_Hset3...A_HsetK)

In order to have a fair comparisons you should train the model always on the same data and use always the same validation set.

I'm not an advanced Python user so probably you can find better suggestions around, but what I would do is to create a list of dictionaries, where each dictionary contain a set of hyper-parameters that you want to test:

grid_search=[{"par1":"val1","par2":"val1","par3":"val1",..., "res"=""},
             {"par1":"val2","par2":"val1","par3":"val1",..., "res"=""},
             {"par1":"val3","par2":"val1","par3":"val1",..., "res"=""},
             ,...,
             {"par1":"valn","par2":"valn","par3":"valn",..., "res"=""}]

So that you can store your results in the "res" field of the corresponding dictionary and track the performances for each set of parameter.

for set in grid_search:
  #insert here your training and accuracy evaluation using the
  #parameters in set
  
  set["res"]= the_Accuracy_for_HyperPar_in_set

I hope it helps.