37
votes

What is the difference between standardscaler and normalizer in sklearn.preprocessing module? Don't both do the same thing? i.e remove mean and scale using deviation?

8

8 Answers

42
votes

From the Normalizer docs:

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.

And StandardScaler

Standardize features by removing the mean and scaling to unit variance

In other words Normalizer acts row-wise and StandardScaler column-wise. Normalizer does not remove the mean and scale by deviation but scales the whole row to unit norm.

14
votes

This visualization and article by Ben helps a lot in illustrating the idea.

enter image description here

The StandardScaler assumes your data is normally distributed within each feature. By "removing the mean and scaling to unit variance", you can see in the picture now they have the same "scale" regardless of its original one.

7
votes

In addition to the excellent suggestion by @vincentlcy to view this article, there is now an example in the Scikit-Learn documentation here. An important difference is that Normalizer() is applied to each sample (i.e., row) rather than column. This may work only for certain datasets that fit its assumption of similar types of data in each column.

4
votes

StandardScaler() standardizes features (such as the features of the person data i.e height, weight)by removing the mean and scaling to unit variance.

(unit variance: Unit variance means that the standard deviation of a sample as well as the variance will tend towards 1 as the sample size tends towards infinity.)

Normalizer() rescales each sample. For example rescaling each company's stock price independently of the other.

Some stocks are more expensive than others. To account for this, we normalize it. The Normalizer will separately transform each company's stock price to a relative scale.

3
votes

The main difference is that Standard Scalar is applied on Columns, while Normalizer is applied on rows, So make sure you reshape your data before normalizing it.

2
votes

StandardScaler standardizes features by removing the mean and scaling to unit variance, Normalizer rescales each sample.

0
votes

Building off of the answer from @TerrenceJ, here is the code to manually calculate the Normalizer-transformed result from the example in the first SKLearn documentation (and note that this reflects the default "l2" normalization).

# create the original example
X = [[4, 1, 2, 2],
     [1, 3, 9, 3],
     [5, 7, 5, 1]]



# Manual Method:

# get the square root of the sum of squares for each record ("row")
import numpy as np
div = [np.sqrt(np.sum(np.power(X[i], 2))) for i in range(len(X))]

# divide each value by its record's respective square root of the sum of squares
np.array([X[k] / div[k] for k in range(len(X))])

# array([[0.8, 0.2, 0.4, 0.4],
#        [0.1, 0.3, 0.9, 0.3],
#        [0.5, 0.7, 0.5, 0.1]])



# SKLearn API Method:

from sklearn.preprocessing import Normalizer
Normalizer().fit_transform(X)

# array([[0.8, 0.2, 0.4, 0.4],
#        [0.1, 0.3, 0.9, 0.3],
#        [0.5, 0.7, 0.5, 0.1]])

0
votes

Perhaps a helpful example:

With Normalizer, it seems that the default operation is to divide each data point in a row, by the $\ell_2$ norm of the row.

For example, given a row [4,1,2,2], the $\ell_2$ norm is: $\sqrt{4^2 + 1^2 + 2^2 + 2^2} = \sqrt{25} = 5$.

The normalized row is then:

[4/5, 1/5, 2/5, 2/5]= [0.8, 0.2, 0.4, 0.4]

This is the first row of the example from the SKLearn docs.