1
votes

I used the classification decision tree in my analysis. First, I split the whole data to training and testing- 60%:40%. Then I used GridSearch on my training set to get the best scored model (max_depth=7). Then I plotted learning curve on cross validation set and training sets. Here is the graph I got. It seems that two lines are overlapping. So what does it tell me? There is no overfitting in my model? And in general, why we need the learning curve in analysis?

Link to my learning curve image

Thanks a lot!

2

2 Answers

4
votes

A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error.

The machine learning curve is useful for many purposes including comparing different algorithms, choosing model parameters during design, adjusting optimization to improve convergence, and determining the amount of data used for training.

You are not making good use of the learning curves tool because you are starting with a very high training size and it does not allow you to see the behavior of the model well.

Here is an example that shows a figure where you start to analyze with a small training size and another that starts with a very large training size (YOUR CASE). To do this, you just have to vary the train_sizes parameter of sklearn.model_selection.learning_curve.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from get_csv_data import HandleData
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

def plot_learning_curve(estimator, X, y, ax=None, ylim=(0.5, 1.01), cv=None, n_jobs=4, train_sizes=np.linspace(.1, 1.0, 5)):

    train_sizes, train_scores, test_scores = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
              
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    # Plot learning curve
    if ylim is not None:
        ax.set_ylim(*ylim)
    ax.set_xlabel("Training examples")
    ax.set_ylabel("Score")
    ax.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    ax.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
    ax.legend(loc="best")

    return plt

fig, (ax1, ax2) = plt.subplots(1, 2)

data = HandleData(oneHotFlag=False)
#get the data
X, y = data.get_synthatic_data()

cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = SVC()
plot_learning_curve(estimator, X, y, ax = ax1, cv=cv, train_sizes=np.linspace(.1, 1.0, 5))
plot_learning_curve(estimator, X, y, ax = ax2, cv=cv, train_sizes=np.linspace(.5, 1.0, 5))

plt.show()

output:

0
votes

Your graph shows accuracy as a function of number of training examples. The greater the number of training examples, the greater the number of training data points that the model is trained on.

Training accuracy is the accuracy score when the trained model is tested on the data it was trained on. Essentially it's tested on data it has already seen

In cross validation the data is randomly split into training and testing sets. The model is trained on the training set, and tested on the testing set. The accuracy score is a reflection of how closely the testing set is predicted.

The lines coincide because the model is likely well trained: it's just as good at predicting things it hasn't seen before as it is on things it was trained on.