1
votes

I am looking to do either a multivariate Linear Regression or a Logistic Regression using Python on some data that has a large number of categorical variables. I understand that with one Categorical variable I would need to translate this into a dummy and then remove one type of dummy so as to avoid colinearity however is anyone familiar with what the approach should be when dealing with more than one type of categorical variable?

Do I do the same thing for each? e.g translate each type of record into a dummy variable and then for each remove one dummy variable so as to avoid colinearity?

2

2 Answers

1
votes

If there are many categorical variables and also in these variables, if there are many levels, using dummy variables might not be a good option.

If the categorical variable has data in form of bins, for e.g, a variable age having data in form 10-18, 18-30, 31-50, ... you can either use Label Encoding or create a new numerical feature using mean/median of the bins or create two features for lower age and upper age

If you have timestamps from initiation of a task to end, for e.g, starting time of machine to the time when the machine was stopped, you can create a new feature by calculating the duration in terms of hours or minutes.

Given many categorical variables but with few number of levels, the obvious and only way out in such cases would be to apply One-Hot Encoding on the categorical variables.

But when a categorical variable has many levels, there may be certain cases which are too rare or too frequent. Applying One-Hot Encoding on such data would affect the model performance badly. In such cases, it'd be recommended to apply certain business logic/feature engineering and thereby reduce the number of levels first. Thereafter you can use One-Hot Encoding on the new feature if it is still categorical.

0
votes

In a case where there is more than one categorical variable that needs to be replaced for a dummy. The approach should be to encode each of the variables for a dummy (as in the case for a single categorical variable) and then remove one instance of each dummy that exists for each variable in order to avoid colinearity.

Basically, each categorical variable should be treated the same as a single individual one.