0
votes

I'm very new to PCA. I have 11 X variables for my model. These are the X variable labels

x = ['Day','Month', 'Year', 'Rolling Average','Holiday Effect', 'Day of the Week', 'Week of the Year', 'Weekend Effect', 'Last Day of the Month', "Quarter" ]

This is the graph I generated from the explained variance. With the x axis being the principal component. enter image description here

[  3.47567089e-01   1.72406623e-01   1.68663799e-01   8.86739892e-02
   4.06427375e-02   2.75054035e-02   2.26578769e-02   5.72892368e-03
   2.49272688e-03   6.37160140e-05]

I need to know whether I have a good selection of features. And how can I know which feature contributions the most.

from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X_norm)
scores = pca.explained_variance_
1
The point of PCA is that you are developing new features to explain the variance in the data. If you're curious which of your features are contributing to the newly derived components you can calculate the correlation between them. Looking at your chart, I would drop principal components 8-10, because they explain very little variance in the data.flyingmeatball
I'm not sure which ones are PC 8-10 to drop?Bryce Ramgovind
The last three values on x-axis. They have very low values of explained variance and can be dropped.Harichandan Pulagam

1 Answers

0
votes

Though I do NOT know the dataset, I recommend that you scale your features before using PCA (variance will be maximized along the axes). I think X_norm refers to that in your code.

By using PCA, we are targeting to reduce dimensionality. In order to do that, we will start with a feature space which includes all X variables in your case, and will end up a projection of that space which typically is a different feature (sub)space.

In practice, when you have correlations between features, PCA can help you to project that correlation to smaller dimensions.

Think about this, if I'm holding a paper on my desk with full of dots on it, do I need the 3rd dimension to represent that dataset? Probably not, since all the dots are on paper and could be represented in 2D space.

When you are trying to decide how many principal components you will use from your new feature space, you can look at explained variance and it will tell you how much information is there for each principal component.

When I look at the principal components in your data, I see that ~85% of the variance could be attributed to first 6 principal components.

You can also set n_components. For example if you use n_components=2, then your transformed dataset will have 2 features.