I'm trying to pre-process some data by hot encoding some categorical data from the sklearn library.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categories =X[:,1].reshape(-1,1))
If all is well, I should be able to encode the data by
X = onehotencoder.fit_transform(X).toarray()
(right?) but this raises this peculiar error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Is there anything basic I am not doing right? I checked the documentation https://scikit-learn.org/0.20/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder , but that wasn't any help for me...
Is there something elementary I am doing wrong? I am also really curious why this error is popping up, I have looked it up but I don't understand what it is doing in this context. Please let me know if I need to provide more information.
TO be clear on the data set: I have ten columns where I only want to hot encode the categorical values for the countries (there are three: France, Germany and Spain), where the rest of the columns hold numerical values.
One thing I am wondering is if for the argument categories one should pass the entire column which one wishes to encode or if one just gives an array with the different values? So instead of
onehotencoder = OneHotEncoder(categories =X[:,1].reshape(-1,1))
Should one do something like
onehotencoder = OneHotEncoder(categories = np.array(['France','Germany','Spain']).reshape(-1,1))
?
Last edit: I just try to find a 'quick' way of hot encoding the specific column in the whole data set.
I am aware I could always take out the column I want to hot encode, run the simple code on that column, and then insert it back into the dataframe, I was just hoping I could find some code which could be applied on many different situations with minimal editing.
Edit: add printscreen of the dataset: Example of the data set