PCA Explained Variance Analysis

Question

I'm very new to PCA. I have 11 X variables for my model. These are the X variable labels

x = ['Day','Month', 'Year', 'Rolling Average','Holiday Effect', 'Day of the Week', 'Week of the Year', 'Weekend Effect', 'Last Day of the Month', "Quarter" ]

This is the graph I generated from the explained variance. With the x axis being the principal component.

[  3.47567089e-01   1.72406623e-01   1.68663799e-01   8.86739892e-02
   4.06427375e-02   2.75054035e-02   2.26578769e-02   5.72892368e-03
   2.49272688e-03   6.37160140e-05]

I need to know whether I have a good selection of features. And how can I know which feature contributions the most.

from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X_norm)
scores = pca.explained_variance_

The point of PCA is that you are developing new features to explain the variance in the data. If you're curious which of your features are contributing to the newly derived components you can calculate the correlation between them. Looking at your chart, I would drop principal components 8-10, because they explain very little variance in the data. — flyingmeatball
The last three values on x-axis. They have very low values of explained variance and can be dropped. — Harichandan Pulagam

Umit Mert Cakmak Umit Mert Cakmak · Accepted Answer · 2017-05-02T15:46:41

Though I do NOT know the dataset, I recommend that you scale your features before using PCA (variance will be maximized along the axes). I think X_norm refers to that in your code.

By using PCA, we are targeting to reduce dimensionality. In order to do that, we will start with a feature space which includes all X variables in your case, and will end up a projection of that space which typically is a different feature (sub)space.

In practice, when you have correlations between features, PCA can help you to project that correlation to smaller dimensions.

Think about this, if I'm holding a paper on my desk with full of dots on it, do I need the 3rd dimension to represent that dataset? Probably not, since all the dots are on paper and could be represented in 2D space.

When you are trying to decide how many principal components you will use from your new feature space, you can look at explained variance and it will tell you how much information is there for each principal component.

When I look at the principal components in your data, I see that ~85% of the variance could be attributed to first 6 principal components.

You can also set n_components. For example if you use n_components=2, then your transformed dataset will have 2 features.

PCA Explained Variance Analysis

1 Answers