I'm trying to use SciKit-Learn to perform PCA on my dataset. I currently have 2,208 rows and 53,741 columns (features). So I want to use PCA to reduce the dimensionality of this dataset.
I'm following Hands-On Machine Learning with SciKit-Learn and TensorFlow
:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)
As far as I understand, this should reduce the number of columns such that they, in total, explain 95% of the variance in my dataset.
Now I want to see how many features (columns) are left in X_reduced
:
X_reduced.shape
(2208, 1)
So it looks like a single feature accounts for at least 95% of the variance in my dataset...
1) This is very surprising, so I looked at how much the most important dimension contributes variance-wise:
pca = PCA(n_components = 1)
X2D = pca.fit_transform(X)
print pca.explained_variance_ratio_
[ 0.98544046]
So it's 98.5%!
How do I figure out what this seemingly magical dimension is?
2) Don't I need to include my target Y
values when doing PCA?
Thanks!