2
votes

I'm trying to use SciKit-Learn to perform PCA on my dataset. I currently have 2,208 rows and 53,741 columns (features). So I want to use PCA to reduce the dimensionality of this dataset.

I'm following Hands-On Machine Learning with SciKit-Learn and TensorFlow:

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)

As far as I understand, this should reduce the number of columns such that they, in total, explain 95% of the variance in my dataset.

Now I want to see how many features (columns) are left in X_reduced:

X_reduced.shape
(2208, 1)

So it looks like a single feature accounts for at least 95% of the variance in my dataset...

1) This is very surprising, so I looked at how much the most important dimension contributes variance-wise:

pca = PCA(n_components = 1)
X2D = pca.fit_transform(X)
print pca.explained_variance_ratio_

[ 0.98544046]

So it's 98.5%!

How do I figure out what this seemingly magical dimension is?

2) Don't I need to include my target Y values when doing PCA?

Thanks!

1

1 Answers

2
votes

This "seemingly magical dimension" is actually a linear combination of all your dimensions. PCA works by changing basis from your original column space to the space spanned by the eigenvectors of your data's covariance matrix. You don't need the Y-values because PCA only needs the eigenvalues and eigenvectors of your data's covariance matrix.