2
votes

While following the Coursera-Machine Learning class, I wanted to test what I learned on another dataset and plot the learning curve for different algorithms.

I (quite randomly) chose the Online News Popularity Data Set, and tried to apply a linear regression to it.

Note : I'm aware it's probably a bad choice but I wanted to start with linear reg to see later how other models would fit better.

I trained a linear regression and plotted the following learning curve :

enter image description here

This result is particularly surprising for me, so I have questions about it :

  • Is this curve even remotely possible or is my code necessarily flawed?
  • If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
  • If it is not, any hint to where I made a mistake?

Here's my code (Octave / Matlab) just in case:

Plot :

lambda = 0;
startPoint = 5000;
stepSize = 500;
[error_train, error_val] = ...
    learningCurve([ones(mTrain, 1) X_train], y_train, ...
                  [ones(size(X_val, 1), 1) X_val], y_val, ...
                  lambda, startPoint, stepSize);
plot(error_train(:,1),error_train(:,2),error_val(:,1),error_val(:,2))
title('Learning curve for linear regression')
legend('Train', 'Cross Validation')
xlabel('Number of training examples')
ylabel('Error')

Learning curve :

S = ['Reg with '];
for i = startPoint:stepSize:m
    temp_X = X(1:i,:);
    temp_y = y(1:i);
    % Initialize Theta
    initial_theta = zeros(size(X, 2), 1); 
    % Create "short hand" for the cost function to be minimized
    costFunction = @(t) linearRegCostFunction(X, y, t, lambda);
    % Now, costFunction is a function that takes in only one argument
    options = optimset('MaxIter', 50, 'GradObj', 'on');
    % Minimize using fmincg
    theta = fmincg(costFunction, initial_theta, options);
    [J, grad] = linearRegCostFunction(temp_X, temp_y, theta, 0);
    error_train = [error_train; [i J]];
    [J, grad] = linearRegCostFunction(Xval, yval, theta, 0);
    error_val = [error_val; [i J]];
    fprintf('%s %6i examples \r', S, i);
    fflush(stdout);
end

Edit : if I shuffle the whole dataset before splitting train/validation and doing the learning curve, I have very different results, like the 3 following :

Learning curve after shuffle 1

Learning curve after shuffle 2

Learning curve after shuffle 3

Note : the training set size is always around 24k examples, and validation set around 8k examples.

1
It is extremely flawed. Your error should decrease, not increaselejlot
Thanks @lejlot for the feedback. Any idea what could have gone wrong? My cost function linearRegCostFunction passed the Coursera validation, so it is probably not the cause...rom_j
First what is lambda? Second, you should compute train error on whole X(you use whole for training so this os your train error)lejlot

1 Answers

3
votes

Is this curve even remotely possible or is my code necessarily flawed?

It's possible, but not very likely. You might be picking the hard to predict instances for the training set and the easy ones for the test set all the time. Make sure you shuffle your data, and use 10 fold cross validation.

Even if you do all this, it is still possible for it to happen, without necessarily indicating a problem in the methodology or the implementation.

If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?

Let's assume that your data can only be properly fitted by a 3rd degree polynomial, and you're using linear regression. This means that the more data you add, the more obviously it will be that your model is inadequate (higher training error). Now, if you choose few instances for the test set, the error will be smaller, because linear vs 3rd degree might not show a big difference for too few test instances for this particular problem.

For example, if you do some regression on 2D points, and you always pick 2 points for your test set, you will always have 0 error for linear regression. An extreme example, but you get the idea.

How big is your test set?

Also, make sure that your test set remains constant throughout the plotting of the learning curves. Only the train set should increase.

If it is not, any hint to where I made a mistake?

Your test set might not be large enough or your train and test sets might not be properly randomized. You should shuffle the data and use 10 fold cross validation.

You might want to also try to find other research regarding that data set. What results are other people getting?

Regarding the update

That makes a bit more sense, I think. Test error is generally higher now. However, those errors look huge to me. Probably the most important information this gives you is that linear regression is very bad at fitting this data.

Once more, I suggest you do 10 fold cross validation for learning curves. Think of it as averaging all of your current plots into one. Also shuffle the data before running the process.