What does the learning curve in classification decision tree mean?

Question

I used the classification decision tree in my analysis. First, I split the whole data to training and testing- 60%:40%. Then I used GridSearch on my training set to get the best scored model (max_depth=7). Then I plotted learning curve on cross validation set and training sets. Here is the graph I got. It seems that two lines are overlapping. So what does it tell me? There is no overfitting in my model? And in general, why we need the learning curve in analysis?

Link to my learning curve image

Thanks a lot!

anakings95 anakings95 · Accepted Answer · 2021-04-09T02:58:06

A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error.

The machine learning curve is useful for many purposes including comparing different algorithms, choosing model parameters during design, adjusting optimization to improve convergence, and determining the amount of data used for training.

You are not making good use of the learning curves tool because you are starting with a very high training size and it does not allow you to see the behavior of the model well.

Here is an example that shows a figure where you start to analyze with a small training size and another that starts with a very large training size (YOUR CASE). To do this, you just have to vary the train_sizes parameter of sklearn.model_selection.learning_curve.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from get_csv_data import HandleData
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

def plot_learning_curve(estimator, X, y, ax=None, ylim=(0.5, 1.01), cv=None, n_jobs=4, train_sizes=np.linspace(.1, 1.0, 5)):

    train_sizes, train_scores, test_scores = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
              
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    # Plot learning curve
    if ylim is not None:
        ax.set_ylim(*ylim)
    ax.set_xlabel("Training examples")
    ax.set_ylabel("Score")
    ax.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    ax.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
    ax.legend(loc="best")

    return plt

fig, (ax1, ax2) = plt.subplots(1, 2)

data = HandleData(oneHotFlag=False)
#get the data
X, y = data.get_synthatic_data()

cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = SVC()
plot_learning_curve(estimator, X, y, ax = ax1, cv=cv, train_sizes=np.linspace(.1, 1.0, 5))
plot_learning_curve(estimator, X, y, ax = ax2, cv=cv, train_sizes=np.linspace(.5, 1.0, 5))

plt.show()

What does the learning curve in classification decision tree mean?

2 Answers