I implemented a model using gradient boosting decision tree as classifier and I plotted learning curves for both training and test sets to decide what to do next in order to improve my model. The result is as the image:
(Y axis is accuracy (percentage of correct prediction) while x axis is the number of samples i use to train the model.)
I understand that the gap between training and testing score is probably due to high variance(overfitting). But the image also shows that the test score (the green line) increases very little while the number of samples grows from 2000 to 3000. The curve of testing score is getting flat. The model is not getting better even with more samples.
My understand is that a flat learning curve usually indicates high bias (underfitting). Is that possible that both underfitting and overfitting are happening in this model? Or is there another explanation for the flat curve?
Any help would be appreciated. Thanks in advance.
=====================================
the code i use is as follows. Basic i use the same code as the example in sklearn document
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
title = "Learning Curves (GBDT)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = GradientBoostingClassifier(n_estimators=450)
X,y= features, target #features and target are already loaded
plot_learning_curve(estimator, title, X, y, ylim=(0.6, 1.01), cv=cv, n_jobs=4)
plt.show()