0
votes

I have a training dataset that has 3233 rows and 62 columns. The independent variable is Happy (train$Happy), which is a binary variable. The other 61 columns are categorical independent variables.

I've created a logistic regression model as follows:

logModel <- glm(Happy ~ ., data = train, family = binary)

However, I want to reduce the number of independent variables that go into the model, perhaps down to 20 or so. I would like to start by getting rid of colinear categorical variables.

Can someone shed some light on how to determine which categorical variables are colinear and what threshold that I should use when removing a variable from a model?

Thank you!

2
This could get you startedDavid Arenburg
You could use Latent Class Analysis to reduce the number of variables, akin to how Factor Analysis is used to tackle multicollinearity for multiple regression.Maxim.K

2 Answers

0
votes

if your variables were categorical then the obvious solution would be penalized logistic regression (Lasso) in R it is implemented in glmnet.

With categorical variables the problem is much more difficult.

I was in a similar situation and I used the importance plot from the package random forest in order to reduce the number of variables. This would not help you to find collinearity but only to rank the variables by importance.

You have only 60 variable and maybe you have a knowledge of the field so you can try to add to you model some variables that makes sense to you (like z=x1-x3 if you think that the value x1-x3 is important.) and then rank them according to a random forest model

0
votes

You could use Cramer's V, or the related Phi or contingency coefficient (see a great paper at http://www.harding.edu/sbreezeel/460%20files/statbook/chapter15.pdf), to measure colinearity among categorical variables. If two or more categorical variables have a Cramer's V value close to 1, it means they're highly "correlated" and you may not need to keep all of them in your logistic regression model.