0
votes

So i have a dataset with pictures, where each column consist of a vector that can be reshaped into a 32x32 picture. The specific dimensions of my dataset is the following 1024 x 20000. Meaning 20000 samples of images.

Now when i look at various ways of doing PCA without using the built in functions from something like scikit-learn people tend to take either the mean of the rows and subtract the resulting matrix from the original one to get the covariance matrix. I.e the following

A = (1024x20000) #dimensions of the numpy array
mean_rows = A.mean(0)
new_A = A-mean_rows

Other times people tend to get the mean of the columns and the subtract that from the original matrix.

A = (1024x20000) #dimensions of the numpy array
mean_rows = A.mean(1)
new_A = A-mean_rows

Now my question is, when are you supposed to do what? Say i have a dataset as my example which of the methods would i use?

Looked at a variety of websites such as https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/, http://sebastianraschka.com/Articles/2014_pca_step_by_step.html

1

1 Answers

0
votes

I think you're talking about normalizing the dataset to have zero mean. You should compute the mean across the axis that contains each observation.

In your example, you have 20,000 observations with 1,024 dimensions each and your matrix has laid out each observation as a column so you should compute the mean of the columns.

In code that would be: A = A - A.mean(axis=0)