Learning curve (high bias / high variance) why the testing learning curve gets flat

Question

I implemented a model using gradient boosting decision tree as classifier and I plotted learning curves for both training and test sets to decide what to do next in order to improve my model. The result is as the image:

(Y axis is accuracy (percentage of correct prediction) while x axis is the number of samples i use to train the model.)

I understand that the gap between training and testing score is probably due to high variance(overfitting). But the image also shows that the test score (the green line) increases very little while the number of samples grows from 2000 to 3000. The curve of testing score is getting flat. The model is not getting better even with more samples.

My understand is that a flat learning curve usually indicates high bias (underfitting). Is that possible that both underfitting and overfitting are happening in this model? Or is there another explanation for the flat curve?

Any help would be appreciated. Thanks in advance.

=====================================

the code i use is as follows. Basic i use the same code as the example in sklearn document

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                    n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

title = "Learning Curves (GBDT)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

estimator = GradientBoostingClassifier(n_estimators=450)
X,y= features, target #features and target are already loaded
plot_learning_curve(estimator, title, X, y, ylim=(0.6, 1.01), cv=cv, n_jobs=4)

plt.show()

Martin Thoma Martin Thoma · Accepted Answer · 2016-12-13T22:59:58

First of all, your training accuracy goes down quite a bit when you add more examples. So this could still be high variance. However, I doubt that this is the only explanation as the gap seems to be too big.

A reason for a gap between the training accuracy and the test accuracy could be a different distribution of the training samples and the test samples. However, with cross-validation this should not happen (do you make a k-fold cross validation where you re-train for each of the k folds?)

Learning curve (high bias / high variance) why the testing learning curve gets flat

3 Answers