selecting optimum no of features using PCA/LDA/MDS in scikit

Question

I want to reduce the features of a dataset using PCA, LDA and MDS. But I want to preserve 95% variance as well.

I couldn't find a way to indicate desired variance in the formulas for the respective algorithms. One paragraph seems relevant in PCA's API (sklearn.decomposition.PCA) -

if n_components == ‘mle’, Minka’s MLE is used to guess the dimension if 0 < n_components < 1, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components

But how can n_components be equal to 'mle' and a fraction at the same time?

setting n_components='mle' reduced the features from 40 to 39, which is not helpful.

goelakash goelakash · Accepted Answer · 2014-12-28T13:00:41

The PCA object in sklearn.decomposition has an attribute called 'explained_variance_ratio_', which is an array that gives the percentage ratio of total variance that each principal component is responsible for, in a decreasing order.

So, you can first create a PCA object to fit the data-

import sklearn.decomposition.PCA as PCA
pca_obj = PCA()
x_trans = pca_obj.fit_transform(x)                   // x is the data

Now, we can keep adding the variance percentages until until we get the desired value (in my case, 0.95)-

s = pca_obj.explained_variance_ratio_
sum=0.0
comp=0

for _ in s:
    sum += _
    comp += 1
    if(sum>=0.95):
        break

The number of required components will be the value of comp

selecting optimum no of features using PCA/LDA/MDS in scikit

1 Answers