I implemented a model in which I use Logistic Regression as classifier and I wanted to plot the learning curves for both training and test sets to decide what to do next in order to improve my model.
Just to give you some information, to do plot the learning curve I defined a function that takes a model, a pre-split dataset (train/test X and Y arrays, NB: using train_test_split
function), a scoring function as input and iterates through the dataset training on n exponentially spaced subsets and returns the learning curves.
My results are in the below image
I wonder why does the training accuracy start so high, then suddenly drop, then start to rise again as training set size increases? And conversely for the test accuracy. I thought extremely good accuracy and the fall was because of some noise due to small datasets in the beginning and then when datasets became more consistent it started to rise but I am not sure. Can someone explain this?
And finally, can we assume that these results mean a low variance/moderate bias (70% accuracy in my context is not that bad) and so to improve my model I must resort to ensemble methods or extreme feature engineering?