I'm am doing PCA and I am interested in which original features were most important. Let me illustrate this with an example:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[1,-1, -1,-1], [1,-2, -1,-1], [1,-3, -2,-1], [1,1, 1,-1], [1,2,1,-1], [1,3, 2,-0.5]])
print(X)
Which outputs:
[[ 1. -1. -1. -1. ]
[ 1. -2. -1. -1. ]
[ 1. -3. -2. -1. ]
[ 1. 1. 1. -1. ]
[ 1. 2. 1. -1. ]
[ 1. 3. 2. -0.5]]
Intuitively, one could already say that feature 1 and feature 4 are not very important due to their low variance. Let's apply pca on this set:
pca = PCA(n_components=2)
pca.fit_transform(X)
comps = pca.components_
Output:
array([[ 0. , 0.8376103 , 0.54436943, 0.04550712],
[-0. , 0.54564656, -0.8297757 , -0.11722679]])
This output represents the importance of each original feature for each of the two principal components (see this for reference). In other words, for the first principal component, feature 2 is most important, then feature 3. For the second principal component, feature 3 looks most important.
The question is, which feature is most important, which one second most etc? Can I use the component_
attribute for this? Or am I wrong and is PCA not the correct method for doing such analyses (and should I use a feature selection method instead)?