I am new to Machine Learning and I am currently using ScikitLearn's MLPClassifier for a Neural Network task. According to Andrew Ng's famous machine learning course, I am plotting the learning curve, in my case by using ScikitLearn's function learning_curve (see also documentation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html):
clf = MLPClassifier(solver='adam', activation='relu', alpha=0.001,
learning_rate='constant',learning_rate_init=0.0001,
hidden_layer_sizes=[39, 37, 31, 34], batch_size=200,
max_iter=1000, verbose=True)
cv=GroupKFold(n_splits=8)
estimator =clf
ylim=(0.7, 1.01)
cv=cv
n_jobs=1
train_sizes=np.linspace(.01, 1.0, 100)
#Calculate learning curve
train_sizes, train_scores, test_scores = learning_curve(
estimator, X_array_train, Y_array_train,
groups=groups_array_train, cv=cv, n_jobs=n_jobs,
train_sizes=train_sizes, scoring='accuracy',verbose=10)
My solver for the MLPClassifier is 'adam' and the batch size is 200.
This is the resulting plot: https://i.imgur.com/jDNoEVg.png
I have two questions concerning the interpretation of such learning curves:
1.) As I understand this learning curve, it gives me the training and crossvalidation score for different amount of training data till the end of one epoch (epoch=one forward pass and one backward pass of all the training examples). Looking at the "gap" betweeen these two and at which score they end up I can diagnose, if I have a high bias or variance problem. However, according to the verbose of my MLPClassifier, the neural network is training over several epochs, so which epoch is given in the curve (first epoch of training, last epoch or average scores over all epochs?). Or is there a misunderstanding from my side with epochs at all?
2.) Starting a new batch (after 200 and 400 training examples), I get spikes. What would be a correct way to interpret them?
3.) Probably understanding 1.) will also answer this: What is making this function so slow, that you need several parallel jobs n_jobs to get it done in a reasonable time? clf.fit(X,y) is fast in my case.
I would be really grateful, if someone could help me to get a better understanding of this. I am also open for literature recommendations.
Many thanks in advance!