3
votes

I implemented a model in which I use Logistic Regression as classifier and I wanted to plot the learning curves for both training and test sets to decide what to do next in order to improve my model.

Just to give you some information, to do plot the learning curve I defined a function that takes a model, a pre-split dataset (train/test X and Y arrays, NB: using train_test_split function), a scoring function as input and iterates through the dataset training on n exponentially spaced subsets and returns the learning curves.

My results are in the below image enter image description here

I wonder why does the training accuracy start so high, then suddenly drop, then start to rise again as training set size increases? And conversely for the test accuracy. I thought extremely good accuracy and the fall was because of some noise due to small datasets in the beginning and then when datasets became more consistent it started to rise but I am not sure. Can someone explain this?

And finally, can we assume that these results mean a low variance/moderate bias (70% accuracy in my context is not that bad) and so to improve my model I must resort to ensemble methods or extreme feature engineering?

2

2 Answers

4
votes

I think what happens is you are overfitting the training samples when the dataset is small (very high training accuracy, low test accuracy). As you grow the dataset size, your classifier starts to generalize better, thus raising the success rate in the test dataset.

After 10^3 dataset, the accuracy seems to level off at 70%, which suggests you achieved a good balance between overfitting the training and underfitting the test dataset

0
votes

As far as my understanding goes, your learning curves indicate a high variance scenario. Accuracy for the training set typically starts high as complex models can usually fit a small number of samples well. As the sample count increases even complex models can't separate the classes perfectly so accuracy starts to go down.

You called the validation dataset "test" but its usually called validation. The fact that the train and validation datasets coverage and then plateau as the sample count increases indicates that the best performance for that model configuration has been found. Getting more sample data wont help. If you want to improve accuracy you'd need to find a way to reduce bias, which usually means tuning your modeling parameters or using a different learning algorithm.