The literature on machine learning strongly suggests normalization of data for SVM (Preprocessing data in scikit-learn). And as answered before, same StandardScalar should be applied to both training and test data.
- What is the advantages of using StandardScalar over manually subtracting the mean and dividing by standard deviation (other than the ability to use it in a pipeline)?
- LinearSVC in scikit-learn depends on one-vs-the-rest for multiple classes (as larsmans mentioned, SVC depends on one-vs-one for multi-class). So what would happen if I have multiple classes trained with a pipeline with normalization as the first estimator? Would it also calculate the mean and standard variation of the each class, and use it during classification?
- To be more specific, does the following classifier apply different mean and standard deviations to each class before svm stage of pipeline?
estimators = [('normalize', StandardScaler()), ('svm', SVC(class_weight = 'auto'))]
clf = Pipeline(estimators)
# Training
clf.fit(X_train, y)
# Classification
clf.predict(X_test)
SVC
does not do OvR training for the multiclass case; that'sLinearSVC
you're referring to.SVC
's behavior is controlled by themulticlass
parameter, which defaults to OvO (one-vs.-one). – Fred Foo