3
votes

I implemented a model using gradient boosting decision tree as classifier and I plotted learning curves for both training and test sets to decide what to do next in order to improve my model. The result is as the image:

(Y axis is accuracy (percentage of correct prediction) while x axis is the number of samples i use to train the model.) enter image description here

I understand that the gap between training and testing score is probably due to high variance(overfitting). But the image also shows that the test score (the green line) increases very little while the number of samples grows from 2000 to 3000. The curve of testing score is getting flat. The model is not getting better even with more samples.

My understand is that a flat learning curve usually indicates high bias (underfitting). Is that possible that both underfitting and overfitting are happening in this model? Or is there another explanation for the flat curve?

Any help would be appreciated. Thanks in advance.

=====================================

the code i use is as follows. Basic i use the same code as the example in sklearn document

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                    n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

title = "Learning Curves (GBDT)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

estimator = GradientBoostingClassifier(n_estimators=450)
X,y= features, target #features and target are already loaded
plot_learning_curve(estimator, title, X, y, ylim=(0.6, 1.01), cv=cv, n_jobs=4)

plt.show()
3

3 Answers

0
votes

First of all, your training accuracy goes down quite a bit when you add more examples. So this could still be high variance. However, I doubt that this is the only explanation as the gap seems to be too big.

A reason for a gap between the training accuracy and the test accuracy could be a different distribution of the training samples and the test samples. However, with cross-validation this should not happen (do you make a k-fold cross validation where you re-train for each of the k folds?)

0
votes

I would say you are overfitting. Considering you are using cross validation, the gap between the training and the cross-validation score is probably too big. Without cross validation or random splitting, it could be that your train and test data differ in some way.

There are a couple of ways you could try to mitigate this:

  • Add more data (the training score will probably still go down a little bit more)
  • Reduce the number of estimators, or even better, use early stopping
  • Increase gamma for prunning
  • Use subsampling (by tree, by column...)

There are lots of parameters that you can play with, so have some fun! :-D

-1
votes

You should pay more attention to your training accuracy. If it goes down during the training, you did something terribly wrong. Check again the correctness of your data (are your labels correct?) and your model.

Normally, both train and test accuracies should increase, but test accuracy is behind.