2
votes

I am trying with a sample dataFrame :

data = [['Alex','USA',0],['Bob','India',1],['Clarke','SriLanka',0]]

df = pd.DataFrame(data,columns=['Name','Country','Traget'])

Now from here, I used get_dummies to convert string column to an integer:

column_names=['Name','Country']  

one_hot = pd.get_dummies(df[column_names])  

After conversion the columns are: Age,Name_Alex,Name_Bob,Name_Clarke,Country_India,Country_SriLanka,Country_USA

Slicing the data.

x=df[["Name_Alex","Name_Bob","Name_Clarke","Country_India","Country_SriLanka","Country_USA"]].values  

y=df['Age'].values

Splitting the dataset in train and test

from sklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=float(0.5),random_state=0)

Logistic Regression

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(x_train, y_train)

Now, model is trained.

For prediction let say i want to predict the "target" by giving "Name" and "Country".
Like : ["Alex","USA"].

Prediction.

If I used this:

logreg.predict([["Alex","USA"]).    

obviously it will not work.

Question1) How to test the prediction after applying one-hot encoding during training?

Question2) How to do prediction on a sample csv file which contains only "Name" and "Country"?

1
First, you should only post one question per thread. Second, your question is best suited for Data Science Stack Exchange.IMCoins
Your model expects array inputs that correspond to ["Name_Alex","Name_Bob","Name_Clarke","Country_India","Country_SriLanka","Country_USA"]. You will have to read your sample csv file and then shape it into an array in this shape, then call logreg.predict(my_array)Karl

1 Answers

6
votes

I suggest you to use sklearn label encoders and one hot encoder packages instead of pd.get_dummies.

Once you initialise label encoder and one hot encoder per feature then save it somewhere so that when you want to do prediction on the data you can easily import saved label encoders and one hot encoders and encode your features again.

This way you are encoding your features again in the same way as you did while making training set.

Below is the code which I use for saving encoders:

labelencoder_dict = {}
onehotencoder_dict = {}
X_train = None
for i in range(0, X.shape[1]):
    label_encoder = LabelEncoder()
    labelencoder_dict[i] = label_encoder
    feature = label_encoder.fit_transform(X[:,i])
    feature = feature.reshape(X.shape[0], 1)
    onehot_encoder = OneHotEncoder(sparse=False)
    feature = onehot_encoder.fit_transform(feature)
    onehotencoder_dict[i] = onehot_encoder
    if X_train is None:
      X_train = feature
    else:
      X_train = np.concatenate((X_train, feature), axis=1)

Now I save this onehotencoder_dict and label encoder_dict and use it later for encoding.

def getEncoded(test_data,labelencoder_dict,onehotencoder_dict):
    test_encoded_x = None
    for i in range(0,test_data.shape[1]):
        label_encoder =  labelencoder_dict[i]
        feature = label_encoder.transform(test_data[:,i])
        feature = feature.reshape(test_data.shape[0], 1)
        onehot_encoder = onehotencoder_dict[i]
        feature = onehot_encoder.transform(feature)
        if test_encoded_x is None:
          test_encoded_x = feature
        else:
          test_encoded_x = np.concatenate((test_encoded_x, feature), axis=1)
  return test_encoded_x