0
votes

I have been trying to build a ML model which predicts the time it takes for different products to go through a deployment pipeline. I have created around 30-40 different features, 90% which are categorical and 10% numerical features. For instance I have one feature "product category" which can take 5 different values. I then create dummies for all my categorical variables and I end up with around 200-300 variables instead.

I have trained a XGboost model and checked the feature importance and noticed that most my features are around <0.001 in importance, and a lot of them, around 30 are 0. What do I do with this information? Should I drop these variables (like drop half of the product categories) or group all of them together inside an "Other" category? Any tips or standard ways of dealing with this?

EDIT: My hyper parameters.

xgb = xgboost.XGBRegressor(
    max_depth = 11,
    n_estimators= 150,
    min_child_weight= 1,
    eta= .3,
    subsample= 0.9,
    gamma= 0.1,
   colsample_bytree= 0.9,
   objective= 'reg:gamma'
   )
1

1 Answers

0
votes

I'm guessing that your data is very sparse. It would be helpful if you mention the hyper-parameters you've used for your model, like max_depth for example. Usually tress are quite robust to number of features, but in GBM we are using weak learners. So if the number of trees you have build are less then number of features, model fails to understand the importance of all features.