scikit-learn preprocessing SVM with multiple classes in a pipeline

Question

The literature on machine learning strongly suggests normalization of data for SVM (Preprocessing data in scikit-learn). And as answered before, same StandardScalar should be applied to both training and test data.

What is the advantages of using StandardScalar over manually subtracting the mean and dividing by standard deviation (other than the ability to use it in a pipeline)?
LinearSVC in scikit-learn depends on one-vs-the-rest for multiple classes (as larsmans mentioned, SVC depends on one-vs-one for multi-class). So what would happen if I have multiple classes trained with a pipeline with normalization as the first estimator? Would it also calculate the mean and standard variation of the each class, and use it during classification?
To be more specific, does the following classifier apply different mean and standard deviations to each class before svm stage of pipeline?

estimators = [('normalize', StandardScaler()), ('svm', SVC(class_weight = 'auto'))]
clf = Pipeline(estimators)
# Training
clf.fit(X_train, y)
# Classification
clf.predict(X_test)

SVC does not do OvR training for the multiclass case; that's LinearSVC you're referring to. SVC's behavior is controlled by the multiclass parameter, which defaults to OvO (one-vs.-one). — Fred Foo
@larsmans- thanks for the correction, I will update the text to avoid it being indexed. — dashesy

joeln joeln · Accepted Answer · 2013-04-22T01:51:46

The feature scaling performed by StandardScaler is performed without reference to the target classes. It only considers the X feature matrix. It calculates the mean and standard deviation of each feature across all samples, irrespective of the target class of each sample.

Each component of the pipeline operates independently: only the data is passed between them. Let's expand the pipeline's clf.fit(X_train, y). It roughly does the following:

X_train_scaled = clf.named_steps['normalize'].fit_transform(X_train, y)
clf.named_steps['svm'].fit(X_train_scaled, y)

The first scaling step actually ignores the y it is passed, but calculates the mean and standard deviation of each feature in X_train and stores them in its mean_ and std_ attributes (the fit component). It also centers X_train and returns it (the transform component). The next step learns an SVM model, and does what is necessary for one-vs-rest.

Now the pipeline's perspective for classification. clf.predict(X_test) expands to:

X_test_scaled = clf.named_steps['normalize'].transform(X_test)
y_pred = clf.named_steps['svm'].predict(X_test_scaled)

returning y_pred. In the first line it uses the stored mean_ and std_ to apply the transformation to X_test using parameters learnt from the training data.

Yes, the scaling algorithm isn't very complicated. It just subtracts the mean and divides by the std. But StandardScalar:

provides a name to the algorithm so you can pull it out of the library
avoids you rolling your own, ensuring it works correctly, and not requiring you to understand what it's doing on the inside
remembers the parameters from a fit or fit_transform for later transform operations (as above)
provides the same interface as other data transformations (and hence can be used in a pipeline)
operates over dense or sparse matrices
is able to reverse the transformation with its inverse_transform method

scikit-learn preprocessing SVM with multiple classes in a pipeline

1 Answers