Feature Mismatch with OneHotEncoder while predicting for a single instance of data

Question

How does onehotencoder work for a single value prediction

Error Msg- ValueError: Number of features of the model must match the input. Model n_features is 1261 and input n_features is 16

I am training a Random Forest classifier on text data. I am computing 16 features per instance of this text data. As all these 16 variables are categorized, I am using OneHotEncoder for each of these 16 variables to encode them. This results in 1261 columns of the training matrix. I have also done feature scaling for these. I have also done a 80:20 train:test split of my training data and applied the predictor to get the confusion matrix, classification report. I am also persisting the classifier, standard scaler variable, onehotencoder variables in pickle formats on my local disk.

Now I want to create a service (REST) of a predictor in a new separate file. This API would use the saved model in .pkl format and predict the value of the new single text value- basically give its predicted class name and corresponding confidence score.

The problem that I am facing is: When I encode this single text value, I get a vector with 16 features. It doesn't get encoded to 1261 features. Therefore, when I run the predict() function on this classifier on the new text, it gives me the following error:

% (self.n_features_, n_features)) ValueError: Number of features of the model must match the input. Model n_features is 1261 and input n_features is 16

How can I use a deserialized pkl model to predict for single instances when the encoded matrix doesn't match the size of the previously trained classifier? How to resolve this issue.

Edit: Posting the code snippet and exception stack as well:

# Loading the .pkl files used in training
with open('model.pkl', 'rb') as f_model:
    classifier = pickle.load(f_model) # trained classifier model

with open('labelencoder_file.pkl', 'rb') as f_lblenc:
    label_encoder = pickle.load(f_lblenc) # label encoder object used in training

with open('encoder_file.pkl', 'rb') as f_onehotenc:
    onehotencoder = pickle.load(f_onehotenc) # onehotencoder object used in training

with open('sc_file.pkl', 'rb') as f_sc:
    scaler = pickle.load(f_sc) # standard scaler object used in training

X = df_features # df_features is the dataframe containing the computed feature values. It has 16 columns as 16 features have been computed for the new value
X.values[:, 0] = label_encoder.fit_transform(X.values[:, 0])
X.values[:, 1] = label_encoder.fit_transform(X.values[:, 1])
# This is repeated  till X.values[:, 15] as all features are categorical

X = onehotencoder.fit_transform(X).toarray()
X = scaler.fit_transform(X)
print(X.shape) # This prints (1, 16), thus showing that encoding has not worked properly

y_pred = classifier.predict(X) # This throws the exception

Traceback (most recent call last):

File "/home/Test/api.py", line 256, in api_func() y_pred = classifier.predict(X)

File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 538, in predict proba = self.predict_proba(X)

File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 578, in predict_proba X = self._validate_X_predict(X)

File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 357, in _validate_X_predict return self.estimators_[0]._validate_X_predict(X, check_input=True)

File "/usr/local/lib/python3.6/dist-packages/sklearn/tree/tree.py", line 384, in _validate_X_predict % (self.n_features_, n_features))

ValueError: Number of features of the model must match the input. Model n_features is 1261 and input n_features is 16

Are you actually applying your OneHotEncoder to your single text value before running predict()? It looks like it is doing nothing - maybe show some code? How are you handling unknown categories, in case your single text value has a category that your OneHotEncoder hasn't seen before? — Stev
While training, I had created a dummy class that would be assigned to a feature, in case the value is unseen — Poulami Debnath
.fit_transform(X) fits and applies the transform and .transform(X) just applies the transform. It appears that you are loading encoders/scalars and then overwriting them by re-fitting. Is this what you are intending to do? — Stev
@Stev correctly mentioned that since you have saved the labelencoder and onehotencoder at training time, you dont need to call fit_transform() on them. Only call transform() — Vivek Kumar
Thanks @Stev! This answers my question. Thanks @Vivek for confirming. The problem that I faced was two-fold. First, I was pickling the label encoder object only once. If there are (say) two categorical variables, then after applying Label Encoding to both of these, both encoders should be persisted. The second problem was of course applying fit_transform instead of transform while preparing the test data for prediction. — Poulami Debnath

Poulami Debnath Poulami Debnath · Accepted Answer · 2018-04-13T08:02:22

Posting the modified code here that solved the problem

'''Loading .pkl files that were persisted during training'''
with open('model.pkl', 'rb') as f_model:
    classifier = pickle.load(f_model) # trained classifier model

with open('labelencoder00.pkl', 'rb') as f_lblenc00:
    label_encoder00 = pickle.load(f_lblenc00) # LabelEncoder() object that was used for encoding the first categorical variable
with open('labelencoder01.pkl', 'rb') as f_lblenc01:
    label_encoder01 = pickle.load(f_lblenc01) # LabelEncoder() object that was used for encoding the second categorical variable

with open('onehotencoder.pkl', 'rb') as f_onehotenc:
    onehotencoder = pickle.load(f_onehotenc) # OneHotEncoder object that was used in training


X = df_features # df_features is the dataframe containing the computed feature values
X.values[:, 0] = label_encoder00.transform(X.values[:, 0])
X.values[:, 1] = label_encoder01.transform(X.values[:, 1])

X = onehotencoder.transform(X).toarray()

pred = classifier.predict(X)

Feature Mismatch with OneHotEncoder while predicting for a single instance of data

1 Answers