I am doing text classification using gensim and doc2vec. I am using two data-sets for testing this, one being a stack exchange data-set and a Reddit data-set. I am trying to classify between posts from one subreddit/stackexchange site on a particular subject and then using posts from other unrelated subreddit/stackexchange sites as negative examples.
I am using a data-set of 10k posts to train the model and a testing set of 5k divided in to 50% positive examples and 50% negative. I then use the infer_vector and most_similar functions to classify the entry as positive or negative. Before training the model I pre-process the data to remove any words, symbols, links etc just leaving the most significant words to train the model. Below is the code used to train the model.
df = pd.read_csv("fulltrainingset.csv")
df.columns.values[0] = "A"
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(df["A"])]
epoch_list = [1,5,10,15,25,50,100,200,300,400]
size_list = [1,5,10,15,25,50,100,200,300]
for x in epoch_list:
for y in size_list:
vec_size = y
max_epochs = x
minimum_count = 1
mode = 0
window_ = 15
negative_sampling = 5
subsampling = 1e-5
alpha = 0.025
minalpha = 0.00025
model = Doc2Vec(alpha=alpha, min_alpha=minalpha, vector_size=vec_size, dm=mode, min_count=minimum_count, window =window_, sample=subsampling ,hs =negative_sampling)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.epochs)#self.epochs
model.alpha -= 0.0002
model.min_alpha = model.alpha
model.save(str(y)+"s_"+str(x)+"e.model")
This method is working and I can get results from it, but I would like to know if there is a different way of training to achieve better results. Currently I am just training many models with different epochs and vector_sizes, then using the infer_vector and most_similar functions to see if the vector score returned from the most_similar entry is greater than a certain number, but is there a way to improve upon this in the aspect of training the model?
Also, aiming to get better results I trained another model in the same way with a larger data-set (100k+ entries). When I used this model on the same data-set it produced similar but worse results to the models trained on smaller data-sets. I thought that more training data would have improved the results not made them worse, does anyone know a reason for this ?
Also, to further test I created a new but bigger test-set (15k entries) which did even worse then the original test-set. The data in this test-set although being unique is the same type of data used in the original test-set yet produces worse results, what may be the reason for this ?
df = pd.read_csv("all_sec_tweets.csv")
df.columns.values[0] = "A"
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(df["A"])]
epoch_list = [1,5,10,15,25,50,100]
size_list = [1,5,10,15,25,50,100]
for x in epoch_list:
for y in size_list:
vec_size = y
max_epochs = x
mode = 0
window_ = 5
subsampling = 1e-5
model = Doc2Vec(vector_size=vec_size, dm=mode, window =window_, sample=subsampling,epochs=max_epochs)
model.build_vocab(tagged_data)
model.train(tagged_data,total_examples=model.corpus_count,epochs=model.epochs)
model.save(str(y)+"s_"+str(x)+"e.model")