2
votes

I have tried reading a number of references about PCA and I found the difference. Some references writes this algorithm :

  1. Prepare the initial data (m x n)
  2. Calculate the Mean
  3. Subtract the initial data with the mean
  4. Calculate the covariance
  5. Calculate eigenvalue and eigenvector
  6. The result data transformations (m x k)

and several other references write this algorithm :

  1. Prepare the initial data (m x n)
  2. Calculate the Mean
  3. Calculate the standard deviation
  4. Count z-score = ((initial data - mean)/standard deviation)
  5. Calculate covariance
  6. Calculate eigenvalue and eigenvector
  7. The result data transformations (m x k)

I'm confused which one is the correct algorithm. Anyone can explain when to use each of these algorithms?

Thank for your help

1

1 Answers

2
votes

From what I see the only difference between the algorithms you list is the normalization by the standard deviation. It is a standard practice which ensures that values having different "range" are re-scaled to similar range. If your data is similarly scaled, this step is not strictly necessary. You can find a more indepth discussion about it here: https://stats.stackexchange.com/questions/134104/why-do-we-divide-by-the-standard-deviation-and-not-some-other-standardizing-fact

To give an example of such scaling problem, we can imagine multidimensional data for which each dimension describes a different quality. For instance, dimension one can be describing distance to some object in mm and would range from 1000-3000, while the other dimensions would describe the R, G and B components of the color of the object as float values ranging from 0.0 to 1.0. In order to make sure that each dimension has similar "influence", we divide it by the standard deviation.