CNN for short text classification perform bad in validation set

Question

Im'using CNN for short text classification (classify the production title). The code is from http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

The accuracy in trainning set, test set, validatino set is blow: enter image description here

and loss is different. The loss of validation is double than the loss of trainning set and test set.(I can't upload more than 2 pictures. sorry!)

The trainning set and test set are from web by crawler, then split them with 7:3．And the validation is from real app message and tagged by manual marking．

I have tried almost every hyper-parameters.

I have tried up-sampling, down-sampling, none-sampling.

batch size of 1024, 2048, 5096

dropout with 0.3, 0.5, 0.7

embedding_size with 30, 50, 75

But none of these work!

Now I use the param below:

batch size is 2048.

embedding_size is 30.

sentence_length is 15

filter_size is 3,4,5

dropout_prob is 0.5

l2_lambda is 0.005

At first I think it is overfit.But the model performs well in test set then trainning set.So I confused!

Is it the distribution between test set and trainning set is much different?

How can I increase the performance in validation set?

Are you sure you have your traces in that plot labeled correctly? Seems weird that your test accuracy is the highest. Almost definitely not right? — chris
@chris_anderson Thx! I'm sure the trace in that plot labeled correctly.I don't know why, The validation accuracy is too low — Nan.Zhang
Are you able to reproduce the accuracy in the original tutorial? What is the expected validation accuracy for this model? — Yao Zhang
@YaoZhang Thx! The expected validation accuracy is 92% at least, like dev accuracy in the graph. And trained several hours later, the validation loss is 6 times larger than training loss. Is it something wrong with my validation set? — Nan.Zhang

Dan Salo Dan Salo · Accepted Answer · 2017-07-21T16:02:02

I think this difference in loss comes from the fact that the validation dataset was collected from a different domain than the training/test sets:

The training set and test set are from web by crawler, then split them with 7:3．And the validation is from real app message and tagged by manual > marking

The model did not see any real app message data during training, so it unsurprisingly fails to deliver good results on the validation set. Traditionally, all three sets are generated from the same pool of data (say, with a 7-1-2 split). The validation set is used for hyperparameter tuning (batch_size, embedding_length, etc.), while the test set is held-out for an objective measure of model performance.

If you are concerned ultimately concerned with performance on the app data, I would split that dataset up 7-1-2 (train-validation-test) and augment the training data with web crawler data.

CNN for short text classification perform bad in validation set

2 Answers