1
votes

I'm trying to learn machine learning.

I had a doubt about one hot encoding:
I have a data set split into 2 excel sheets of data. One sheet has train and other has test data. I first trained my model by importing the train data sheet with pandas. There are categorical features in the data set that have to be encoded. I one hot encoded them.

After importing the test dataset , if I one hot encode it, will the encoding be the same as of the train data set or will it be different. If so, how can I solve this issue?

2
How did you perfrom one-hot encoding.? Manually or by using sklearn.?Sreeram TP
If your train/test sets contain different values in the categorical that you are one hot encoding, then you will get different columns returned. IMO, your options are to either encode train/test together, or write a function to add the appropriate dummy columns to your one hot encoded train/test sets.Stev

2 Answers

1
votes

OneHot Encoding creates binary attribute per category or per value, one attribute equal to 1 ( and o otherwise). One Attribute equal to 1 (hot), while the others will be 0 (cold).

sample example:-

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
1hot = encoder.fit_transform(df_object.reshape(-1,1))
1hot

sample output:-

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

you need to check if an attribute which you are fitting in oneHotEncoding are relatively closeby values or not.

0
votes

you have 2 seperate sheets ( for test and train data set). you have to one-hot encode both the sheets seperately after importing it into the pandas data frame.

and YES one hot encoding will be the same for the same data set no matter you apply on different data sheets, make sure you have same categorical values in that column in each of your data sheet