Getting correct shape for datapoint to predict with a Regression model after using One-Hot-Encoding in training

Question

I am writing an application which uses Linear Regression. In my case sklearn.linear_model.Ridge. I have trouble bringing my datapoint I like to predict in the correct shape for Ridge. I briefly describe my two applications and how the problem turns up:

1RST APPLICATION:

My datapoints have just 1 feature each, which are all Strings, so I am using One-Hot-Encoding to be able to use them with Ridge. After that, the datapoints (X_hotEncoded) have 9 features each:

import pandas as pd
X_hotEncoded = pd.get_dummies(X)

After fitting Ridge to X_hotEncoded and labels y I save the trained model with:

from sklearn.externals import joblib
joblib.dump(ridge, "ridge.pkl")

2ND APPLICATION:

Now that I have a trained model saved on disk, I like to retrieve it in my 2nd application and predict y (Label) for just one datapoint. That's where I encounter above mentioned problem:

# X = one datapoint I like to predict y for 
ridge= joblib.load("ridge.pkl")
X_hotEncoded = pd.get_dummies(X)
ridge.predict(X_hotEncoded) # this should give me the prediction

This gives me the following Error in the last line of code:

ValueError: shapes (1,1) and (9,) not aligned: 1 (dim 1) != 9 (dim 0)

Ridge was trained with 9 features because of the use of One-Hot-Encoding I used on all the datapoints. Now, when I like to predict just one datapoint (with just 1 feature) I have trouble bringing this datapoint in the correct shape for Ridge to be able to handle it. One-Hot-Encoding has no affect on jsut one datapoint with just one feature.

Does anybody know a neat solution to this problem?

A possible solution might be to write the column names to disk in the 1rst Application and retrieve it in the 2nd and then rebuild the datapoint there. The column names of one-hot-encoded arrays could be retrieved like stated here: Reversing 'one-hot' encoding in Pandas

Toterich Toterich · Accepted Answer · 2017-07-10T15:03:42

What happens here is the following:

During the training-phase, you decided on an encoding to transform a single categorical feature into 9 numerical ones (One Hot). You trained your regression algorithm on this encoding. So in order to use it for unknown (test-) data, you have to transform this data in exactly the same way as you did during training.

Unfortunately, I dont think you can save the encoding used by pd.get_dummies and reuse it. You should use sklearn.preprocessing.OneHotEncoder() instead. So during training:

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()
X_hotEncoded = enc.fit_transform(X)

fit_transform() first fits the encoder to your training data and then uses it to transform the data. The difference to pd.get_dummies() is that you now you now have an encoder object which you can save und reuse later:

joblib.dump(enc, "encoder.pkl")

During testing you can apply the same encoding used during training like this:

enc = joblib.load("encoder.pkl")
X_hotEncoded = enc.transform(X)

Note that you don't want to fit the encoder again (this is what pd.get_dummies() would do) because it is crucial that the same encoding is used for the training and test data.

Watch out:

You will run into problems if the test-data contains values which were not present in the training data (because then the encoder does not know how to encode these unknown values). To avoid this, you can either:

provide OneHotEncoder() with the categories argument, passing it a list of all your categories.
provide OneHotEncoder() with the handle_unknown argument set to ignore. This avoids the error and just sets all columns to zero.
perform One Hot Encoding before splitting the data into training and test set.
provide OneHotEncoder() with the n_values argument telling the encoder how many different categories to expect for each input feature [edit: deprecated since version 0.20].

Getting correct shape for datapoint to predict with a Regression model after using One-Hot-Encoding in training

1 Answers