0
votes

I'm trying to pre-process some data by hot encoding some categorical data from the sklearn library.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values

# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder

onehotencoder = OneHotEncoder(categories =X[:,1].reshape(-1,1))

If all is well, I should be able to encode the data by

X = onehotencoder.fit_transform(X).toarray()

(right?) but this raises this peculiar error:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Is there anything basic I am not doing right? I checked the documentation https://scikit-learn.org/0.20/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder , but that wasn't any help for me...

Is there something elementary I am doing wrong? I am also really curious why this error is popping up, I have looked it up but I don't understand what it is doing in this context. Please let me know if I need to provide more information.

TO be clear on the data set: I have ten columns where I only want to hot encode the categorical values for the countries (there are three: France, Germany and Spain), where the rest of the columns hold numerical values.

One thing I am wondering is if for the argument categories one should pass the entire column which one wishes to encode or if one just gives an array with the different values? So instead of

onehotencoder = OneHotEncoder(categories =X[:,1].reshape(-1,1))

Should one do something like

onehotencoder = OneHotEncoder(categories = np.array(['France','Germany','Spain']).reshape(-1,1))

?

Last edit: I just try to find a 'quick' way of hot encoding the specific column in the whole data set.

I am aware I could always take out the column I want to hot encode, run the simple code on that column, and then insert it back into the dataframe, I was just hoping I could find some code which could be applied on many different situations with minimal editing.

Edit: add printscreen of the dataset: Example of the data set

1
just remove the .toarray() from fit_transformTwinkle Patel
Thanks for the reply. However, even if i remove that, I run into the same errorMaurits van Roozendaal
Can you provide a sample of your data? Do you have missing values in it ?Catalina Chircu
Why did you fill the categories argument in OneHotEncoder? If you need to encode all the categories it will do it automatically. Which categories do you want to transform, exactly?Catalina Chircu
Hi, I have a range of numerical values, the category gender, which I can encode using label and doesn't need hotencoding, the only category I want to encode is the 'Geography' category which can only take the values France, Germany and Spain. I fill in the categories argument because if I leave it on 'auto', it seems to encode all the values, so also numerical values which is of course not what I want. I succeed in hot encoding the geography by slicing this column and working as follows, but then it gets messy to insert them into the dataset again... I'll provide an example of the datasetMaurits van Roozendaal

1 Answers

0
votes

You have a problem here : when you do :

X = dataset.iloc[:, 3:13].values

you obtain a numpy array, without the column names.

The you do:

onehotencoder = OneHotEncoder(categories =X[:,1].reshape(-1,1))

This means you need as categories only the values from the second column. But you need unique values, therefore you should do :

categories=np.unique(X[:,1]).reshape(-1,1)
onehotencoder = OneHotEncoder(categories=categories, handle_unknown='ignore')

I added also the handle_unknown argument, in case that there might be values in your matrix that are not in categories.

Start by trying this.