3
votes

scikit-learn's SVM is based on LIBSVM. LIBSVM/SVM requires that the data should be scaled and the recommendation is that a feature value should be in one of the two ranges [0, 1] or [-1, 1]. That is, in the typical matrix, each column is a feature and the scaling is done per column.

LIBSVM FAQ suggests a simple scaling to get the features between [0, 1]:

x'=(x-min)/(Max-min)

Does scikit-learn support this "simple scaling"? Are there other recommendations for scaling the features to use with SVM and RBF kernel. Any references? I found a reference article called "A Practical Guide to Support Vector Classification" that is based on LIBSVM and they recommend scaling to [0, 1] or [-1, 1].

2

2 Answers

5
votes

Yes, this functionality is included. The exact formula you describe will be in the next release as sklearn.preprocessing.MinMaxScaler. For now, sklearn.preprocessing.Scaler (to be renamed StandardScaler in the next release, but the old name will stay around for backward compat) centers and scales features to have mean 0 and variance 1, which should be good enough for passing data to an SVM learner.

Also, sklearn.preprocessing.Normalizer (and the TfidfVectorizer that is used for text classification) normalizes values per sample into the range [0, 1]. This amounts to the length normalization that is common in text classification and information retrieval.

You can use a Pipeline object to construct a centering, scaling SVM classifier:

clf = Pipeline([('scale', Scaler()),
                ('svm', SVC())])
1
votes

I think you're looking for the StandardScaler, at least for the [-1,1] case.