Can I perform a Post Hoc on Principle Component Analysis?

Question

I performed a PCA on my data, and I have 4 principal components. However, it is very difficult to interpret my results with principal components. Therefore, I was wondering can I do a post hoc, by taking the variable with the highest variance in PC1 (say X1) and the variable with the highest variances in PC2 (say X2) and perform a regression analysis, with an outcome variable y, to test their association? (i.e. lm(Y~X1+X2))

Here's an example: I have 4 independent variables: memory test, cognition test, attention test, and processing speed test. I have 1 dependent variable, brain connectivity. Therefore, once I perform a PCA I get something like this:

PC1: 0.7X1+0.2x3
PC2: 0.8X2
PC3: 0.8X3+0.4X4
PC4: 0.1X4

PC1 and PC2 explain 82% of variance in the data. However, I'm not sure what to make of this information. How can I interpret this information based on my original variables? So I was thinking to perform a regression between the variables found within the principle components to analyze further what components may be driving this difference. Lm(connectivity~memory+cognition test)

Does that make sense? How can I go about this?

Can you try to clarify what you did a little bit more? Maybe provide a small but reproducible case. About the more "theoretical" side of your question, why would you try to perform regression analysis on variables that are linearly independent? My point is: PCA analysis main goal is to separate your variables into orthogonal, linearly indepent variables. So you would not find an association between them. — eduardokapp
Sure, I added a more detailed example. please see above @eduardokapp — J.Doe

eduardokapp eduardokapp · Accepted Answer · 2020-09-11T20:09:15

What the PCA analysis result means after all is to tell you which combination of variables lead to the highest variance. Like you pointed out, PC1 and PC2 explain most of the variance (or information) on your dataset. Why? Because their eigenvalues are the highest.

You could now drop the variable X4 for example, since it is only present in the least important components. About the idea of doing a "post-hoc" regression analysis on PC1 and PC2, I don't think this would lead you anywhere.. PC1 and PC2 are, by definition, linearly independent. So there is no linear relation between them.

Does any of this clarify your doubts?

I'm open for further discussions :)

Can I perform a Post Hoc on Principle Component Analysis?

1 Answers