0
votes

I am trying to learn the basics of PCA analysis in Python using scikit libraries (in particular sklearn.decomposition and sklearn.preprocessing). The goal is to import data from images into a matrix X (each row is a sample, each column is a feature), then standardize X, use PCA to extract principal components (2 most important, 6 most important....6 less important), project X on these principal components, reverse the previous transformation and plot the result in order to see the difference with respect to the original image/images.

Now let's say that I do not want to consider the 2,3,4... most important principal components but I want to consider the N less relevant components, let's say N=6.

How should the analysis be done? I mean I can't simply standardize then call PCA().fit_transform and then revert back with inverse_transform() to plot the results.

At the moment I am doing something like this:

X_std = StandardScaler().fit_transform(X) # standardize original data
pca = PCA()
model = pca.fit(X_std) # create model with all components
Xprime = model.components_[range(dim-6, dim, 1),:] # get last 6 PC

And then I stop because I know I should call transform() but I do not understand how to do it...I tried several times withouth being successfull.

Is there someone that can tell me if previous steps are correct and point out the direction to follow?

Thank you very much


EDIT: currently I have adapted this solution as suggested by the first answer to my question:

model = PCA().fit(X_std)
model2pc = model 
model2pc.components_[range(2, img_count, 1), :] = 0
Xp_2pc = model2pc.transform(X_std)
Xr_2pc = model2pc.inverse_transform(Xp_2pc)

And then I do the same for 6pc, 60pc, last 6 pc. What I have noticed is that this is very time consuming. I would like to get a model directly extracting the principal components I need (without zeroing out the others) and then perform transform() and inverse_transform() on that with that model.

1
Please clarify exactly what you are trying to do... How are you going to use these principal components?desertnaut
I have a set of 1087 rgb images divided into 4 categories, each image is 227x227 so the dimension of each image is 227x227x3. I open as a numpy array each image, reshape it (I use ravel() method) and add it to a matrix that has 1 row for each image and 1 column for each "feature" so in my case the matrix is 1087 x 154587. Let's call this matrix X. What I do is: standardize X, use PCA to extract 6 less important principal components and project X on these components. Then revert back the projection and plot one of the samples in order to see the difference with respect to the original image.matteof93
I can do this without any problem with the N-most important principal components because is just a matter of using the library function in the correct order but with the N-less important principal components is different. What I am trying to do is to run PCA() on all components and then using the attribute components_ I can extract the last rows of the component's matrix. The problem is that after extracting the rows related to the N less important components I do not know how to proceed.matteof93

1 Answers

1
votes

If you want to ignore all but the last 6 principal components, you can just zero out the ones you don't want to keep.

N = 6
X_std = StandardScaler().fit_transform(X)
pca = PCA()
model = pca.fit(X_std) # create model with all components
model.components_[:-N] = 0

Then, to remove all but the last N components from the data, just do a forward and inverse transform of the data:

Xprime = model.inverse_transform(model.transform(X_std))

Here is an example:

>>> X = np.random.rand(18).reshape(6, 3)
>>> model = PCA().fit(X)

A round-trip transform should give back the original data:

>>> X
array([[0.16594796, 0.02366958, 0.8403745 ],
       [0.25219425, 0.22879029, 0.07950927],
       [0.69636084, 0.4410933 , 0.97431828],
       [0.50121079, 0.44835563, 0.95236146],
       [0.6793044 , 0.53847562, 0.27882302],
       [0.32886931, 0.0643043 , 0.10597973]])
>>> model.inverse_transform(model.transform(X))
array([[0.16594796, 0.02366958, 0.8403745 ],
       [0.25219425, 0.22879029, 0.07950927],
       [0.69636084, 0.4410933 , 0.97431828],
       [0.50121079, 0.44835563, 0.95236146],
       [0.6793044 , 0.53847562, 0.27882302],
       [0.32886931, 0.0643043 , 0.10597973]])

Now zero out the first principal component:

>>> model.components_
array([[ 0.22969899,  0.21209762,  0.94986998],
       [-0.67830467, -0.66500728,  0.31251894],
       [ 0.69795497, -0.71608653, -0.0088847 ]])
>>> model.components_[:-2] = 0
>>> model.components_
array([[ 0.        ,  0.        ,  0.        ],
       [-0.67830467, -0.66500728,  0.31251894],
       [ 0.69795497, -0.71608653, -0.0088847 ]])

The round-trip transform now gives a different result since we've removed the first principal component (which contains the greatest amount of variance):

>>> model.inverse_transform(model.transform(X))
array([[ 0.12742811, -0.01189858,  0.68108405],
       [ 0.36513945,  0.33308073,  0.54656949],
       [ 0.58029482,  0.33392119,  0.49435263],
       [ 0.39987803,  0.35478779,  0.53332196],
       [ 0.71114004,  0.56787176,  0.41047233],
       [ 0.44000711,  0.16692583,  0.56556581]])