scikit-learn: what is the difference between SVC and SGD?

Question

SVM: http://scikit-learn.org/stable/modules/svm.html#classification

SGD: http://scikit-learn.org/stable/modules/sgd.html#classification

seem to do pretty much the same to my eyes,as they write "an SGD implements a linear model". Can someone explain the differences between them?

sascha sascha · Accepted Answer · 2017-12-29T12:09:27

SVM is a support-vector machine which is a special linear-model. From a theoretical view it's a convex-optimization problem and we can get the global-optimum in polynomial-time. There are many different optimization-approaches.

In the past people used general Quadratic Programming solvers. Nowadays specialized approaches like SMO and others are used.

sklearn's specialized SVM-optimizers are based on liblinear and libsvm. There are many documents and research papers if you are interested in the algorithms.

Keep in mind, that SVC (libsvm) and LinearSVC (liblinear) make different assumptions in regards to the optimization-problem, which results in different performances on the same task (linear-kernel: LinearSVC much more efficient than SVC in general; but some tasks can't be tackled by LinearSVC).

SGD is an Stochastic Gradient Descent-based (this is a general optimization method!) optimizer which can optimize many different convex-optimization problems (actually: this is more or less the same method used in all those Deep-Learning approaches; so people use it in the non-convex setting too; throwing away theoretical-guarantees).

sklearn says: Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions. Now it's actually even more versatile, but here it's enough to note that it subsumes (some) SVMs, logistic-regression and others.

Now SGD-based optimization is very different from QP and others. If one would take QP for example, there are no hyper-parameters to tune. This is a bit simplified, as there can be tuning, but it's not needed to guarantee convergence and performance! (theory of QP-solvers, e.g. Interior-point method is much more robust)

SGD-based optimizers (or general first-order methods) are very very hard to tune! And they need tuning! Learning-rates or learning-schedules in general are parameters to look at as convergence depends on these (theory and practice)!

It's a very complex topic, but some simplified rules:

Specialized SVM-methods
- scale worse with the number of samples
- do not need hyper-parameter tuning
SGD-based methods
- scale better for huge-data in general
- need hyper-parameter tuning
- solve only a subset of the tasks approachable by the the above (no kernel-methods!)

My opinion: use (the easier to use) LinearSVC as long as it's working, given your time-budget!

Just to make it clear: i highly recommend grabbing some dataset (e.g. from within sklearn) and do some comparisons between those candidates. The need for param-tuning is not a theoretical-problem! You will see non-optimal (objective / loss) results in the SGD-case quite easily!

And always remember: Stochastic Gradient Descent is sensitive to feature scaling docs. This is more or less a consequence of first-order methods.

scikit-learn: what is the difference between SVC and SGD?

2 Answers