0
votes

I am doing text classification using gensim and doc2vec. I am using two data-sets for testing this, one being a stack exchange data-set and a Reddit data-set. I am trying to classify between posts from one subreddit/stackexchange site on a particular subject and then using posts from other unrelated subreddit/stackexchange sites as negative examples.

I am using a data-set of 10k posts to train the model and a testing set of 5k divided in to 50% positive examples and 50% negative. I then use the infer_vector and most_similar functions to classify the entry as positive or negative. Before training the model I pre-process the data to remove any words, symbols, links etc just leaving the most significant words to train the model. Below is the code used to train the model.

df = pd.read_csv("fulltrainingset.csv")

df.columns.values[0] = "A"

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(df["A"])]

epoch_list = [1,5,10,15,25,50,100,200,300,400]
size_list = [1,5,10,15,25,50,100,200,300]

for x in epoch_list:
    for y in size_list:

        vec_size = y
        max_epochs = x
        minimum_count = 1
        mode = 0
        window_ = 15
        negative_sampling = 5
        subsampling = 1e-5
        alpha = 0.025
        minalpha = 0.00025

        model = Doc2Vec(alpha=alpha, min_alpha=minalpha, vector_size=vec_size, dm=mode, min_count=minimum_count, window =window_, sample=subsampling ,hs =negative_sampling)
        model.build_vocab(tagged_data)

        for epoch in range(max_epochs):
            print('iteration {0}'.format(epoch))
            model.train(tagged_data,
                        total_examples=model.corpus_count,
                        epochs=model.epochs)#self.epochs
            model.alpha -= 0.0002
            model.min_alpha = model.alpha


        model.save(str(y)+"s_"+str(x)+"e.model")

This method is working and I can get results from it, but I would like to know if there is a different way of training to achieve better results. Currently I am just training many models with different epochs and vector_sizes, then using the infer_vector and most_similar functions to see if the vector score returned from the most_similar entry is greater than a certain number, but is there a way to improve upon this in the aspect of training the model?

Also, aiming to get better results I trained another model in the same way with a larger data-set (100k+ entries). When I used this model on the same data-set it produced similar but worse results to the models trained on smaller data-sets. I thought that more training data would have improved the results not made them worse, does anyone know a reason for this ?

Also, to further test I created a new but bigger test-set (15k entries) which did even worse then the original test-set. The data in this test-set although being unique is the same type of data used in the original test-set yet produces worse results, what may be the reason for this ?

df = pd.read_csv("all_sec_tweets.csv")

df.columns.values[0] = "A"

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(df["A"])]

epoch_list = [1,5,10,15,25,50,100]
size_list = [1,5,10,15,25,50,100]

for x in epoch_list:
    for y in size_list:

        vec_size = y
        max_epochs = x
        mode = 0
        window_ = 5
        subsampling = 1e-5

        model = Doc2Vec(vector_size=vec_size, dm=mode, window =window_, sample=subsampling,epochs=max_epochs)
        model.build_vocab(tagged_data)

        model.train(tagged_data,total_examples=model.corpus_count,epochs=model.epochs)

        model.save(str(y)+"s_"+str(x)+"e.model")
1

1 Answers

0
votes

It sounds as if you're training a separate Doc2Vec model for each forum's "in"/"out" decision, then using an improvised set of infer_vector()/most_similar() operations to make a decision.

That's a very rough, ad-hoc approach, and you should look into intros to more formal text-classification approaches, where there is a clear step of feature-discovery (which might include creating Doc2Vec vectors for your texts, or other techniques), then a clear step of classifier-training, then evaluation.

(You might also at that point then be training larger models which include labeled training examples from all forums, and classifiers which pick one-of-many possible classes.)

Separately, several things are wrong or non-optimal in your Doc2Vec training, including:

  • It's almost always misguided to be calling train() more than once in your own loop, or to be changing the default alpha/min_alpha values. You current code is in fact making model.epochs (5) passes over the data for every call, and often decrementing alpha by 0.0002 hundreds of times (into nonsensensical negative values). Call train() just once, with the desired number of epochs, with default alpha/min_alpha values, and it will do the right thing. (And: don't trust whatever online tutorial/example suggested the above looping calls.)

  • Your hs=5 will turn the strictly on/off hierarchical-softmax mode on, but leaves the default negative=5 parameter in place - so your model will be using a (non-standard and probably unhelpful and slow) combination of both negative-sampling and hierarchical-softmax training. It's better to use either some negative value and hs=0 (for pure negative-sampling), or negative=0, hs=1 (for pure hierarchical-softmax). Or just stick with the default (negative=5, hs=0) unless/until everything is already working and you want to descend into deeper optimizations.

  • min_count=1 is rarely the best option: these models often benefit from discarding rare words.

After correcting these issues, you may find that more data then tends to bring the usual expected improved results. (And if it doesn't at that time, double-check that all text preprocessing/tokenization is done right, at training and inference and evaluation – and if you're still having problems, perhaps post a new question then, with more specifics/numbers about where expected improvements have instead scored worse.)