0
votes

I have several categorical variables with high number of classes. I used one-hot encoding in order to convert them into 1-0 format.

original:

column_1    column_2
0.8            X        
0.3            C        
0.9            D        
1.2            C        

one-hot encoded:

column_1    column_2_X   column_2_C  column_2_D  
0.8            1            0           0
0.3            0            1           0
0.9            0            0           1
1.2            0            1           0

Then I checked feature_importances of them.

For example column_2_C has no importance to model, but others which share the same category(A) has significant importance.

In this case or any other case(%50 of the classes have high importance %50 of them are very low) what should I do? What if column_2_C has crucially significant but others (X and D) has no importance at all?

What happens if I remove that class? Any best practice for this kind of case?

Thanks in advance,

1

1 Answers

1
votes

If you are using the dummy variables in a model, then removing the non-significant variables or non-confounders is appropriate. However, if you are retaining the original categorical variable you should not delete those observations from your sample. I would need more information regarding what you are doing.