Using Standardization in sklearn pipeline

Question

I am using Standardscaler to normalize my dataset, that is I turn each feature into a z-score, by subtracting the mean and dividing by the Std.

I would like to use Standardscaler within sklearn's pipeline and I am wondering how exactly the transformation is applied to X_test. That is, in the code below, when I run pipeline.predict(X_test), it is my understanding that the StandardScaler and SVC() is run on X_test, but what exactly does Standardscaler use as the mean and the StD? The ones from the X_Train or does it compute those only for X_test? What if, for instance X_test consists only of 2 variables, the normalization would look a lot different than if I had normalized X_train and X_test altogether, right?

steps = [('scaler', StandardScaler()),
     ('model',SVC())] 
pipeline = Pipeline(steps)
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)

Chris Chris · Accepted Answer · 2019-01-04T16:59:13

Sklearn's pipeline will apply transformer.fit_transform() when pipeline.fit() is called and transformer.transform() when pipeline.predict() is called. So for your case, StandardScaler will be fitted to X_train and then the mean and stdev from X_train will be used to scale X_test.

The transform of X_train would indeed look different to that of X_train and X_test. The extent of the difference would depend on the extent of the difference in the distributions between X_train and X_test combined. However, if randomly partitioned from the same original dataset, and of a reasonable size, the distributions of X_train and X_test will probably be similar.

Regardless, it is important to treat X_test as though it is out of sample, in order for it to be a (hopefully) reliable metric for unseen data. Since you don't know the distribution of unseen data, you should pretend you don't know the distribution of X_test, including the mean and stdev.

Using Standardization in sklearn pipeline

1 Answers