How does onehotencoder work for a single value prediction
Error Msg- ValueError: Number of features of the model must match the input. Model n_features is 1261 and input n_features is 16
I am training a Random Forest classifier on text data. I am computing 16 features per instance of this text data. As all these 16 variables are categorized, I am using OneHotEncoder for each of these 16 variables to encode them. This results in 1261 columns of the training matrix. I have also done feature scaling for these. I have also done a 80:20 train:test split of my training data and applied the predictor to get the confusion matrix, classification report. I am also persisting the classifier, standard scaler variable, onehotencoder variables in pickle formats on my local disk.
Now I want to create a service (REST) of a predictor in a new separate file. This API would use the saved model in .pkl format and predict the value of the new single text value- basically give its predicted class name and corresponding confidence score.
The problem that I am facing is: When I encode this single text value, I get a vector with 16 features. It doesn't get encoded to 1261 features. Therefore, when I run the predict() function on this classifier on the new text, it gives me the following error:
% (self.n_features_, n_features)) ValueError: Number of features of the model must match the input. Model n_features is 1261 and input n_features is 16
How can I use a deserialized pkl model to predict for single instances when the encoded matrix doesn't match the size of the previously trained classifier? How to resolve this issue.
Edit: Posting the code snippet and exception stack as well:
# Loading the .pkl files used in training
with open('model.pkl', 'rb') as f_model:
classifier = pickle.load(f_model) # trained classifier model
with open('labelencoder_file.pkl', 'rb') as f_lblenc:
label_encoder = pickle.load(f_lblenc) # label encoder object used in training
with open('encoder_file.pkl', 'rb') as f_onehotenc:
onehotencoder = pickle.load(f_onehotenc) # onehotencoder object used in training
with open('sc_file.pkl', 'rb') as f_sc:
scaler = pickle.load(f_sc) # standard scaler object used in training
X = df_features # df_features is the dataframe containing the computed feature values. It has 16 columns as 16 features have been computed for the new value
X.values[:, 0] = label_encoder.fit_transform(X.values[:, 0])
X.values[:, 1] = label_encoder.fit_transform(X.values[:, 1])
# This is repeated till X.values[:, 15] as all features are categorical
X = onehotencoder.fit_transform(X).toarray()
X = scaler.fit_transform(X)
print(X.shape) # This prints (1, 16), thus showing that encoding has not worked properly
y_pred = classifier.predict(X) # This throws the exception
Traceback (most recent call last):
File "/home/Test/api.py", line 256, in api_func() y_pred = classifier.predict(X)
File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 538, in predict proba = self.predict_proba(X)
File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 578, in predict_proba X = self._validate_X_predict(X)
File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 357, in _validate_X_predict return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "/usr/local/lib/python3.6/dist-packages/sklearn/tree/tree.py", line 384, in _validate_X_predict % (self.n_features_, n_features))
ValueError: Number of features of the model must match the input. Model n_features is 1261 and input n_features is 16
OneHotEncoderto your single text value before runningpredict()? It looks like it is doing nothing - maybe show some code? How are you handling unknown categories, in case your single text value has a category that yourOneHotEncoderhasn't seen before? - Stev.fit_transform(X)fits and applies the transform and.transform(X)just applies the transform. It appears that you are loading encoders/scalars and then overwriting them by re-fitting. Is this what you are intending to do? - Stevfit_transform()on them. Only calltransform()- Vivek Kumar