7
votes

I'm am doing PCA and I am interested in which original features were most important. Let me illustrate this with an example:

import numpy as np
from sklearn.decomposition import PCA
X = np.array([[1,-1, -1,-1], [1,-2, -1,-1], [1,-3, -2,-1], [1,1, 1,-1], [1,2,1,-1], [1,3, 2,-0.5]])
print(X)

Which outputs:

[[ 1.  -1.  -1.  -1. ]
[ 1.  -2.  -1.  -1. ]
[ 1.  -3.  -2.  -1. ]
[ 1.   1.   1.  -1. ]
[ 1.   2.   1.  -1. ]
[ 1.   3.   2.  -0.5]]

Intuitively, one could already say that feature 1 and feature 4 are not very important due to their low variance. Let's apply pca on this set:

pca = PCA(n_components=2)
pca.fit_transform(X)
comps = pca.components_

Output:

array([[ 0.        ,  0.8376103 ,  0.54436943,  0.04550712],
       [-0.        ,  0.54564656, -0.8297757 , -0.11722679]])

This output represents the importance of each original feature for each of the two principal components (see this for reference). In other words, for the first principal component, feature 2 is most important, then feature 3. For the second principal component, feature 3 looks most important.

The question is, which feature is most important, which one second most etc? Can I use the component_ attribute for this? Or am I wrong and is PCA not the correct method for doing such analyses (and should I use a feature selection method instead)?

1

1 Answers

8
votes

The component_ attribute is not the right spot to look for feature importance. The loadings in the two arrays (i.e. the two componments PC1 and PC2) tell you how your original matrix is transformed by each feature (taken together, they form a rotational matrix). But they don't tell you how much each component contributes to describing the transformed feature space, so you don't know yet how to compare the loadings across the two components.

However, the answer that you linked actually tells you what to use instead: the explained_variance_ratio_ attribute. This attribute tells you how much of the variance in your feature space is explained by each principal component:

In [5]: pca.explained_variance_ratio_
Out[5]: array([ 0.98934303,  0.00757996])

This means that the first prinicpal component explaines almost 99 percent of the variance. You know from components_ that PC1 has the highest loading for the second feature. It follows, therefore, that feature 2 is the most important feature in your data space. Feature 3 is the next most important feature, as it has the second highest loading in PC1.

In PC2, the absolute loadings are nearly swapped between feature 2 and feature 3. But as PC2 explains next to nothing of the overall variance, this can be neglected.