4
votes

SVM: http://scikit-learn.org/stable/modules/svm.html#classification

SGD: http://scikit-learn.org/stable/modules/sgd.html#classification

seem to do pretty much the same to my eyes,as they write "an SGD implements a linear model". Can someone explain the differences between them?

2

2 Answers

18
votes

SVM is a support-vector machine which is a special linear-model. From a theoretical view it's a convex-optimization problem and we can get the global-optimum in polynomial-time. There are many different optimization-approaches.

In the past people used general Quadratic Programming solvers. Nowadays specialized approaches like SMO and others are used.

sklearn's specialized SVM-optimizers are based on liblinear and libsvm. There are many documents and research papers if you are interested in the algorithms.

Keep in mind, that SVC (libsvm) and LinearSVC (liblinear) make different assumptions in regards to the optimization-problem, which results in different performances on the same task (linear-kernel: LinearSVC much more efficient than SVC in general; but some tasks can't be tackled by LinearSVC).

SGD is an Stochastic Gradient Descent-based (this is a general optimization method!) optimizer which can optimize many different convex-optimization problems (actually: this is more or less the same method used in all those Deep-Learning approaches; so people use it in the non-convex setting too; throwing away theoretical-guarantees).

sklearn says: Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions. Now it's actually even more versatile, but here it's enough to note that it subsumes (some) SVMs, logistic-regression and others.

Now SGD-based optimization is very different from QP and others. If one would take QP for example, there are no hyper-parameters to tune. This is a bit simplified, as there can be tuning, but it's not needed to guarantee convergence and performance! (theory of QP-solvers, e.g. Interior-point method is much more robust)

SGD-based optimizers (or general first-order methods) are very very hard to tune! And they need tuning! Learning-rates or learning-schedules in general are parameters to look at as convergence depends on these (theory and practice)!

It's a very complex topic, but some simplified rules:

  • Specialized SVM-methods

    • scale worse with the number of samples
    • do not need hyper-parameter tuning
  • SGD-based methods

    • scale better for huge-data in general
    • need hyper-parameter tuning
    • solve only a subset of the tasks approachable by the the above (no kernel-methods!)

My opinion: use (the easier to use) LinearSVC as long as it's working, given your time-budget!

Just to make it clear: i highly recommend grabbing some dataset (e.g. from within sklearn) and do some comparisons between those candidates. The need for param-tuning is not a theoretical-problem! You will see non-optimal (objective / loss) results in the SGD-case quite easily!

And always remember: Stochastic Gradient Descent is sensitive to feature scaling docs. This is more or less a consequence of first-order methods.

1
votes

SVC(SVM) uses kernel based optimisation, where, the input data is transformed to complex data(unravelled) which is expanded thus identifying more complex boundaries between classes. SVC can perform Linear and Non-Linear classification

SVC can perform Linear classification by setting the kernel parameter to 'linear' svc = SVC(kernel='linear')

SVC can perform non-linear classification by setting the kernel parameter to 'poly' , 'rbf'(default) svc = SVC(kernel='poly') svc = SVC(kernel='rbf')

SGDClassifier uses gradient descent optimisation technique, where, the optimum coefficients are identified by iteration process. SGDClassifier can perform only linear classification

SGDClassifer can use Linear SVC(SVM) model when the parameter loss is set to 'hinge'(which is the default) i.e SGDClassifier(loss='hinge')