MATLAB fminunc() not completing for large datasets. Works for smaller ones

Question

I am performing logistic regression in MATLAB with L2 regularization on text data. My program works well for small datasets. For larger sets, it keeps running infinitely.

I have seen the potentially duplicate question (matlab fminunc not quitting (running indefinitely)). In that question, the cost for initial theta was NaN and there was an error printed in the console. For my implementation, I am getting a real valued cost and there is no error even with verbose parameters being passed to fminunc(). Hence I believe this question might not be a duplicate.

I need help in scaling it to larger sets. The size of the training data I am currently working on is roughly 10k*12k (10k text files cumulatively containing 12k words). Thus, I have m=10k training examples and n=12k features.

My cost function is defined as follows:

function [J gradient] = costFunction(X, y, lambda, theta)

    [m n] = size(X);
    g = inline('1.0 ./ (1.0 + exp(-z))'); 
    h = g(X*theta);
    J =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h))+ (lambda/(2*m))*norm(theta(2:end))^2;

    gradient(1) = (1/m)*sum((h-y) .* X(:,1));

    for i = 2:n
        gradient(i) = (1/m)*sum((h-y) .* X(:,i)) - (lambda/m)*theta(i);
    end
end

I am performing optimization using MATLAB's fminunc() function. The parameters I pass to fminunc() are:

options = optimset('LargeScale', 'on', 'GradObj', 'on', 'MaxIter', MAX_ITR);
theta0 = zeros(n, 1);

[optTheta, functionVal, exitFlag] = fminunc(@(t) costFunction(X, y, lambda, t), theta0, options);

I am running this code on a machine with these specifications:

Macbook Pro i7 2.8GHz / 8GB RAM / MATLAB R2011b

The cost function seems to behave correctly. For initial theta, I get acceptable values of J and gradient.

K>> theta0 = zeros(n, 1);
K>> [j g] = costFunction(X, y, lambda, theta0);
K>> j

j =

    0.6931

K>> max(g)

ans =

    0.4082

K>> min(g)

ans =

  -2.7021e-05

The program takes incredibly long to run. I started profiling keeping MAX_ITR = 1 for fminunc(). With a single iteration, the program did not complete execution even after a couple of hours had elapsed. My questions are:

Am I doing something wrong mathematically?
Should I use any other optimizer instead of fminunc()? With LargeScale=on, fminunc() uses trust-region algorithms.
Is this problem cluster-scale and should not be run on a single machine?

Any other general tips will be appreciated. Thanks!

This helped solve the problem: I was able to get this working by setting the LargeScale flag to 'off' in fminunc(). From what I gather, LargeScale = 'on' uses trust region algorithms, while keeping it 'off' uses quasi-newton methods. Using quasi-newton methods and passing the gradient worked a lot faster for this particular problem and gave very nice results.

The problem is fairly small, nowhere near cluster-scale. Using a general purpose solver like fminunc is overkill, though. You are probably better off using another solver. Have you considered other methods (e.g. linear SVM, which is known to perform very well for text classification)? To give you an idea, a small problem like this can be solved in a matter of seconds with linear SVM. — Marc Claesen
Well, profiling/debug mode will certainly slow it down. Did you try setting the 'Display' option to 'iter' using optimset? to see what fminunc is doing? On the small datasets where it does work, what is the exitflag describing the exit condition? Also, Why do you have an inline equation in your cost function? This could be replaced with an anonymous function (g = @(z)1./(1+exp(-z))) or removed entirely (h = 1./(1+exp(-X*theta))). — horchler
@MarcClaesen Thanks Marc. I wanted to specifically try logistic regression for this problem. You mentioned that it might be better that I try another solver. Would you recommend any particular solver for this purpose? — Viraj Kulkarni
@horchler I tried setting Display=iter. On smaller sets when the program works, I get output on every iteration. For larger sets when the program does not work, I get absolutely no output from fminunc(). The exitflag for smaller sets is 0 and the output says: 'Solver stopped prematurely. fminunc stopped because it exceeded the iteration limit, options.MaxIter = 50 (the selected value).' — Viraj Kulkarni
Does the solver eventually converge if you allow more iterations? If not, then there is no point in trying the larger data sets. Have you checked if your cost function is even being evaluated in these large data cases by printing X (and/or n and J and gradient)? You may have also specified a bad initial guess, X0, in which case the algorithm is forced to do a lot of work before it can get going. — horchler

Viraj Kulkarni Viraj Kulkarni · Accepted Answer · 2013-08-07T05:20:40

I was able to get this working by setting the LargeScale flag to 'off' in fminunc(). From what I gather, LargeScale = 'on' uses trust region algorithms, while keeping it 'off' uses quasi-newton methods. Using quasi-newton methods and passing the gradient worked a lot faster for this particular problem and gave very nice results.

MATLAB fminunc() not completing for large datasets. Works for smaller ones

3 Answers