0
votes

I have trained a logistic regression model with 5 levels of a categorical variable and all the levels are significant for the model.

However on unseen data, the number of levels of categorical variable is 3. Hence the trained model is failing to predict on the unseen data as its not able to find some of the levels.

I have used one hot encoding to convert the categorical variable. How this issue can be resolved?

Code used to convert to dummy variables in the train set:

   metadata_employeegroup = pd.get_dummies(df['metadata_employeegroup'],prefix='metadata_employeegroup',drop_first=True)
   df = pd.concat([df,metadata_employeegroup],axis=1)

Based on RFE, only some factor levels are significant for the model. So while training the model, am subsetting the train set based on those columns

logsk.fit(X_train[col], y_train)
y_pred = logsk.predict_proba(X_test[col])

Here col contains only 3 levels of metadata_employeegroup. Say L1, L2, L3.

On unseen data, am following the same approach to create the dummy variables. However the levels of metadata_employeegroup are L1 and L2. The trained model is not able to find the L3 level and is throwing an error.

1
You should post some data and code, otherwise, it is still unclear what you have done and what is the error. The lack of some levels in test data is fine as long as your coding is consistent between training and test. Thus, the first thing I would check is whether your coding is consistent. For example, if you use 5 dummies for coding the training set, you should also use 5 dummies for the test set (even if 2 of the dummies are always zero) - 9mat
I have added some code, and explained in detail the issue am facing. Can you please look into it. - Biplab Ghosal

1 Answers

0
votes

For the levels of categorical variables missing in the unseen data, create new features in the data by adding those missing levels and keeping the value as 0 for all the records.

I was able to solve using this One Hot Encoding Tutorial