0
votes

I have a dependent variable y and 6 independent variables. I want to make a linear regression out of it. I use sklearn library to do it.

The problem is some of my independent variables have correlation more than 0.5. So I can't have them in my model at the same time

I searched throw internet but didn't find any solution to select best set of independent variables to draw linear regression and output the variables that had been selected.

2
One possibility is to first try a fit with all variables, and then remove from the regression the variable with the least significance and then re-run to see what happens to the fitting results. This test is easy to perform and might help in your analytical work. - James Phillips

2 Answers

2
votes

If you see that you have a correlation between independent variables. You should consider to remove them.

I see you are working with scikit-learn. If you don't want to do any feature selection manually, you could always use one of the feature selection methods in scikit-learns feature_selection module. There are many ways to automatically remove features, and you should cross-validate to determine which one is best for your problem.

1
votes

You are probably looking for a k-fold validation model.

The idea is to randomly select your features, and have a way to validate them against each other.

The idea is to train your model with your feature selection on (k-1) partitions of your data. And validate it against the last partition. You do it for each partition and take the average of your score (MAE / RMSE for instance)

Your score is an objectif figure to compare your models aka your features selections