Python PCA sklearn

Question

I'm trying to apply a PCA dimensionality reduction to a dataset that it's 684 x 1800 (observations x features). I want to reduce the amount of features. When I perfom the PCA, it tells me that to obtain the 100% of variance explained, there should be 684 features, so my data should be 684 x 684. Is it not too strange? I mean, exactly the same number...

Is there any explanation or I'm applying the PCA wrongly?

I know that there're needed 684 components to explain the whole variance cause I plot the cumulative sum of .explained_variance_ratio and it sums 1 with 684 components. And also because of the code below.

My code is basically:

pca = PCA(0.99999999999)
pca.fit(data_rescaled)
reduced = pca.transform(data_rescaled)
print(reduced.shape)
print(pca.n_components_)

Of course, I don't want to keep the whole variance, 95% is also acceptable. It is just a wonderful serendipity?

Thank you so much

MaximeKan MaximeKan · Accepted Answer · 2020-12-13T19:30:41

You are using PCA correctly, and this is expected behavior. The explanation for this is connected with the underlying maths behind PCA, and it certainly is not a coincidence that 100% of the variance would be explained with 684 components, which is the number of observations.

There is this theorem in algebra that tells you that if you have a matrix A of dimensions (n, m), then rank(A) <= min(n, m). In your case, the rank of your data matrix is at most 684, which is the number of observations. Why is this relevant? Because this tells you that essentially, you could rewrite your data in such a way that at most 684 of your features would be linearly independent, meaning that all remaining features would be linear combinations of the others. In this new space, you could therefore keep all information about your sample with no more than 684 features. This is also what the PCA does.

To sum it up, what you observed is just a mathematical property of the PCA decomposition.

Python PCA sklearn

1 Answers