What is the difference between standardscaler and normalizer in sklearn.preprocessing module? Don't both do the same thing? i.e remove mean and scale using deviation?
8 Answers
From the Normalizer docs:
Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.
And StandardScaler
Standardize features by removing the mean and scaling to unit variance
In other words Normalizer acts row-wise and StandardScaler column-wise. Normalizer does not remove the mean and scale by deviation but scales the whole row to unit norm.
This visualization and article by Ben helps a lot in illustrating the idea.
The StandardScaler assumes your data is normally distributed within each feature. By "removing the mean and scaling to unit variance", you can see in the picture now they have the same "scale" regardless of its original one.
In addition to the excellent suggestion by @vincentlcy to view this article, there is now an example in the Scikit-Learn documentation here. An important difference is that Normalizer()
is applied to each sample (i.e., row) rather than column. This may work only for certain datasets that fit its assumption of similar types of data in each column.
StandardScaler() standardizes features (such as the features of the person data i.e height, weight)by removing the mean and scaling to unit variance.
(unit variance: Unit variance means that the standard deviation of a sample as well as the variance will tend towards 1 as the sample size tends towards infinity.)
Normalizer() rescales each sample. For example rescaling each company's stock price independently of the other.
Some stocks are more expensive than others. To account for this, we normalize it. The Normalizer will separately transform each company's stock price to a relative scale.
Building off of the answer from @TerrenceJ, here is the code to manually calculate the Normalizer-transformed result from the example in the first SKLearn documentation (and note that this reflects the default "l2" normalization).
# create the original example
X = [[4, 1, 2, 2],
[1, 3, 9, 3],
[5, 7, 5, 1]]
# Manual Method:
# get the square root of the sum of squares for each record ("row")
import numpy as np
div = [np.sqrt(np.sum(np.power(X[i], 2))) for i in range(len(X))]
# divide each value by its record's respective square root of the sum of squares
np.array([X[k] / div[k] for k in range(len(X))])
# array([[0.8, 0.2, 0.4, 0.4],
# [0.1, 0.3, 0.9, 0.3],
# [0.5, 0.7, 0.5, 0.1]])
# SKLearn API Method:
from sklearn.preprocessing import Normalizer
Normalizer().fit_transform(X)
# array([[0.8, 0.2, 0.4, 0.4],
# [0.1, 0.3, 0.9, 0.3],
# [0.5, 0.7, 0.5, 0.1]])
Perhaps a helpful example:
With Normalizer, it seems that the default operation is to divide each data point in a row, by the norm of the row.
For example, given a row [4,1,2,2]
, the norm is: .
The normalized row is then:
[4/5, 1/5, 2/5, 2/5]= [0.8, 0.2, 0.4, 0.4]
This is the first row of the example from the SKLearn docs.