1
votes

I'm building a logistic regression model in Matlab with the Classification Learner Toolbox.

I ran PCA in Matlab:

[coeff, score, latent, tsquared, explained] = pca(CreditNumeric);

Here's the coeff, score, latent and explained output:

enter image description here

enter image description here

I want to use the results of PCA to reduce the input features I'm using as input in the Classification Learner (based upon my PCA results). How do I use the PCA results to select (say 5-7) features which best describe 95% of the variance of the data?

enter image description here

1
"1. The model should include up to 7 variables, including any of the given attributes or their derivatives. Explain how you arrived at the selected variables. "dbl001
btw - in R's caret package the centering, normalization and PCA are all in the pre-process stage of a pipeline. That's nice. But I still need to know which of the original attributes were chosen as factors.dbl001

1 Answers

0
votes

It is actually very simple, since in the Classification learner when you upload all your variables you can choose the features that you want to use to train your model(s) (see the last screenshots where the "Feature selection" button appeared, next to the Import Data)

It is there, you can select as many variables as you like, and also train several combinations and compared at the end the differences between results.

The issue here is, I think if your 5-7 features (in this case Principal components) are or not describing the 95% of the variance of the data right?

For solving this, you could follow two approaches:

  1. The simplest but not the best one:

-Upload in the Classification learner all your variables instead of the Principal Components, and use the PCA button that, in the new version of MatLab appeared next to the Feature selection one.

-Then you can establish the % of the explained variance (95) and the number of components (7)

  1. I suggested although to develop pca before in MatLab, so you can see, control and analyze all the results and then train the principal components with the learner.

On this way, you can actually know how many components you need to use in your model that explains 95% of the variance. And possible it is not 5-7, or maybe is less than that...explore first.

It is my suggestion. Good luck!