How to speed up up Stochastic Gradient Descent?

Question

I'm trying to fit a regression model with an L1 penalty, but I'm having trouble finding an implementation in python that fits in a reasonable amount of time. The data I've got is on the order of 100k by 500 (sidenote; several of the variables are pretty correlated), but running the sklearn Lasso implementation on this takes upwards of 12 hours to fit a single model (I'm not actually sure of the exact time, I've left it running overnight several times and it never finished).

I've been looking into Stochastic Gradient Descent as a way to get the job done faster. However, the SGDRegressor implementation in sklearn takes on the order of 8 hours to fit when I'm using 1e5 iterations. This seems like a relatively small amount (and the docs even suggest that the model often takes around 1e6 iters to converge).

I'm wondering if there's something that I'm being stupid about which is causing the fits to take a really long time. I've been told that SGD is often used for its efficiency (something around O(n_iter * n_samp * n_feat), though so far I haven't seen much improvement over Lasso.

To speed things up, I have tried:

Decreasing n_iter, but this often leads to a pretty bad solution because it hasn't converged yet.
Increasing the step size (and decreasing n_iter), but this often makes the loss function explode
Changing the learning rate types (from inverse scaling to an amount based off of the number of iterations), and this also didn't seem to make a huge difference.

Any suggestions for speeding this process up? It seems like partial_fit might be part of the answer, though the docs on this are somewhat sparse. I'd love to be able to fit these models without waiting for three days apiece.

There's an interesting paper on Stochastic Gradient Descent Tricks by Microsoft Research that may help — miraculixx
Ah that is quite useful. I also looked through the sklearn SGD code, and it looks like the actually make a pass through the whole dataset on each iteration, so it's actually much faster to converge than I thought. — choldgraf
Partial_fit is not the answer. It will not speed anything up. If anything, it would make it slower. The implementation is pretty efficient, and I am surprised that you say convergence is slow. How many iterations do you run, and what does the objective look like? Often tuning the initial learning rate can give speedups. Your dataset really shouldn't be a problem. You could try vopal wabbit, which is an even faster implementation, but it shouldn't be necessary. — Andreas Mueller

Andreas Mueller Andreas Mueller · Accepted Answer · 2014-06-10T19:08:31

Partial_fit is not the answer. It will not speed anything up. If anything, it would make it slower.

The implementation is pretty efficient, and I am surprised that you say convergence is slow. You do way to many iterations, I think. Have you looked at how the objective decreases?

Often tuning the initial learning rate can give speedups. Your dataset really shouldn't be a problem. I'm not sure if SGDRegressor does that internally, but rescaling your target to unit variance might help.

You could try vopal wabbit, which is an even faster implementation, but it shouldn't be necessary.

How to speed up up Stochastic Gradient Descent?

1 Answers