Principal Component Analysis (PCA) Algorithm

Question

I have tried reading a number of references about PCA and I found the difference. Some references writes this algorithm :

Prepare the initial data (m x n)
Calculate the Mean
Subtract the initial data with the mean
Calculate the covariance
Calculate eigenvalue and eigenvector
The result data transformations (m x k)

and several other references write this algorithm :

Prepare the initial data (m x n)
Calculate the Mean
Calculate the standard deviation
Count z-score = ((initial data - mean)/standard deviation)
Calculate covariance
Calculate eigenvalue and eigenvector
The result data transformations (m x k)

I'm confused which one is the correct algorithm. Anyone can explain when to use each of these algorithms?

Thank for your help

Andrzej Pronobis Andrzej Pronobis · Accepted Answer · 2015-05-18T04:05:28

From what I see the only difference between the algorithms you list is the normalization by the standard deviation. It is a standard practice which ensures that values having different "range" are re-scaled to similar range. If your data is similarly scaled, this step is not strictly necessary. You can find a more indepth discussion about it here: https://stats.stackexchange.com/questions/134104/why-do-we-divide-by-the-standard-deviation-and-not-some-other-standardizing-fact

To give an example of such scaling problem, we can imagine multidimensional data for which each dimension describes a different quality. For instance, dimension one can be describing distance to some object in mm and would range from 1000-3000, while the other dimensions would describe the R, G and B components of the color of the object as float values ranging from 0.0 to 1.0. In order to make sure that each dimension has similar "influence", we divide it by the standard deviation.

Principal Component Analysis (PCA) Algorithm

1 Answers