4
votes

I am performing logistic regression in MATLAB with L2 regularization on text data. My program works well for small datasets. For larger sets, it keeps running infinitely.

I have seen the potentially duplicate question (matlab fminunc not quitting (running indefinitely)). In that question, the cost for initial theta was NaN and there was an error printed in the console. For my implementation, I am getting a real valued cost and there is no error even with verbose parameters being passed to fminunc(). Hence I believe this question might not be a duplicate.

I need help in scaling it to larger sets. The size of the training data I am currently working on is roughly 10k*12k (10k text files cumulatively containing 12k words). Thus, I have m=10k training examples and n=12k features.

My cost function is defined as follows:

function [J gradient] = costFunction(X, y, lambda, theta)

    [m n] = size(X);
    g = inline('1.0 ./ (1.0 + exp(-z))'); 
    h = g(X*theta);
    J =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h))+ (lambda/(2*m))*norm(theta(2:end))^2;

    gradient(1) = (1/m)*sum((h-y) .* X(:,1));

    for i = 2:n
        gradient(i) = (1/m)*sum((h-y) .* X(:,i)) - (lambda/m)*theta(i);
    end
end

I am performing optimization using MATLAB's fminunc() function. The parameters I pass to fminunc() are:

options = optimset('LargeScale', 'on', 'GradObj', 'on', 'MaxIter', MAX_ITR);
theta0 = zeros(n, 1);

[optTheta, functionVal, exitFlag] = fminunc(@(t) costFunction(X, y, lambda, t), theta0, options);

I am running this code on a machine with these specifications:

Macbook Pro i7 2.8GHz / 8GB RAM / MATLAB R2011b

The cost function seems to behave correctly. For initial theta, I get acceptable values of J and gradient.

K>> theta0 = zeros(n, 1);
K>> [j g] = costFunction(X, y, lambda, theta0);
K>> j

j =

    0.6931

K>> max(g)

ans =

    0.4082

K>> min(g)

ans =

  -2.7021e-05

The program takes incredibly long to run. I started profiling keeping MAX_ITR = 1 for fminunc(). With a single iteration, the program did not complete execution even after a couple of hours had elapsed. My questions are:

  1. Am I doing something wrong mathematically?

  2. Should I use any other optimizer instead of fminunc()? With LargeScale=on, fminunc() uses trust-region algorithms.

  3. Is this problem cluster-scale and should not be run on a single machine?

Any other general tips will be appreciated. Thanks!


This helped solve the problem: I was able to get this working by setting the LargeScale flag to 'off' in fminunc(). From what I gather, LargeScale = 'on' uses trust region algorithms, while keeping it 'off' uses quasi-newton methods. Using quasi-newton methods and passing the gradient worked a lot faster for this particular problem and gave very nice results.


3
The problem is fairly small, nowhere near cluster-scale. Using a general purpose solver like fminunc is overkill, though. You are probably better off using another solver. Have you considered other methods (e.g. linear SVM, which is known to perform very well for text classification)? To give you an idea, a small problem like this can be solved in a matter of seconds with linear SVM.Marc Claesen
Well, profiling/debug mode will certainly slow it down. Did you try setting the 'Display' option to 'iter' using optimset? to see what fminunc is doing? On the small datasets where it does work, what is the exitflag describing the exit condition? Also, Why do you have an inline equation in your cost function? This could be replaced with an anonymous function (g = @(z)1./(1+exp(-z))) or removed entirely (h = 1./(1+exp(-X*theta))).horchler
@MarcClaesen Thanks Marc. I wanted to specifically try logistic regression for this problem. You mentioned that it might be better that I try another solver. Would you recommend any particular solver for this purpose?Viraj Kulkarni
@horchler I tried setting Display=iter. On smaller sets when the program works, I get output on every iteration. For larger sets when the program does not work, I get absolutely no output from fminunc(). The exitflag for smaller sets is 0 and the output says: 'Solver stopped prematurely. fminunc stopped because it exceeded the iteration limit, options.MaxIter = 50 (the selected value).'Viraj Kulkarni
Does the solver eventually converge if you allow more iterations? If not, then there is no point in trying the larger data sets. Have you checked if your cost function is even being evaluated in these large data cases by printing X (and/or n and J and gradient)? You may have also specified a bad initial guess, X0, in which case the algorithm is forced to do a lot of work before it can get going.horchler

3 Answers

1
votes

I was able to get this working by setting the LargeScale flag to 'off' in fminunc(). From what I gather, LargeScale = 'on' uses trust region algorithms, while keeping it 'off' uses quasi-newton methods. Using quasi-newton methods and passing the gradient worked a lot faster for this particular problem and gave very nice results.

0
votes

Here is my advise:

-Set the Matlab flag to show debug output during run. If not just print out in your cost function the cost, which will allow you to monitor iteration count and error.

And second, which is very important:

Your problem is illposed, or so to say underdetermined. You have a 12k feature space and provide only 10k examples, which means for an unconstrained optimization the answer is -Inf. To make a quick example why this is, your problem is like: Minimize x+y+z given that x+y-z = 2. Feature space dim 3, spanned vector space - 1d. I suggest use PCA or CCA to reduce the dimensionality of the the text files by retaining their variation up to 99%. This will probably give you a feature space ~100-200dim.

PS: Just to point out that the problem is very fram from cluster size requirement, which usually is 1kk+ data points and that fminunc is not at all an overkill, and LIBSVM has nothing to do with it because fminunc is just a quadratic optimizer, while LIBSVM is a classifier. To clear out LIBSVM uses something similar to fminunc just with different objective function.

0
votes

Here's what I suspect to be the issue, based on my experience with this type of problem. You're using a dense representation for X instead of a sparse one. You're also seeing the typical effect in text classification that the number of terms increasing roughly linearly with the number of samples. Effectively, the cost of the matrix multiplication X*theta goes up quadratically with the number of samples.

By contrast, a good sparse matrix representation only iterates over the non-zero elements to do a matrix multiplication, which tends to be roughly constant per document if they're of appropriately constant length, causing linear instead of quadratic slowdown in the number of samples.

I'm not a Matlab guru, but I know it has a sparse matrix package, so try to use that.