1
votes

I have one column in a csv which are the names of fruits which I want to convert into an array.

Sample csv column:

Names:
Apple
Banana
Pear
Watermelom
Jackfruit
..
..
..

There are around 400 fruit names in the column

I have used one hot encoding for the same but unable to display the column names(each fruit name from a row of the csv column)

My code till now is:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

dataset = pd.read_csv('D:/fruits.csv')
X= dataset.iloc[:, 0].values


labelencoder_X = LabelEncoder()
D= labelencoder_X.fit_transform(X)
D = D.reshape(-1, 1)

onehotencoder = OneHotEncoder(sparse=False, categorical_features = [0])
X = onehotencoder.fit_transform(D)

This converts the data of the column into a numpy array but the columns names are coming as [0 1 2 3 .. ..] which I want as each row name of the csv, example [Apple Banana Pear Watermelon .. .. ]

How can I retain the column names after using one hot encoding

1
can you add your current output & desired output in question?Furqan Hashim
.values changes dataframe to numpy array which doesn't support string column names. You can try X = pd.DataFrame(X, columns = dataset.columns)Sachin Prabhu
@SachinPrabhu I am getting the error "ValueError: Shape of passed values is (1, 68197), indices imply (3, 68197)"Lalit
Does this answer your question? Feature names from OneHotEncoderBen Reiniger

1 Answers

2
votes

Orignal Answer:

A rather efficient way to OneHotEncode would be to use pd.get_dummies. I've applied on sample data:

data = {'Names':['Apple','Banana','Pear', 'Watermelon']}
df = pd.DataFrame(data=data)

df_new = pd.get_dummies(df)
print(df_new) 

Orignal df:

        Names
0       Apple
1      Banana
2        Pear
3  Watermelon

Encoded df:

   Names_Apple  Names_Banana  Names_Pear  Names_Watermelon
0            1             0           0                 0
1            0             1           0                 0
2            0             0           1                 0
3            0             0           0                 1

Edit:

Let's assume that our dataframe contains 2 Categorical & 2 Numeric features. We just want to OneHotEncode 1 of the 2 Categorical columns.

Generating dummy Data:

data = {'Names':['Apple','Banana','Pear', 'Watermelom'],
        'Category' :['A','B','A','B'],
        'Val1':[10,20,30,30],
        'Val2':[60,70,80,90]}
df = pd.DataFrame(data=data)

        Names Category  Val1  Val2
0       Apple        A    10    60
1      Banana        B    20    70
2        Pear        A    30    80
3  Watermelom        B    30    90

If we just want to OneHotEncode Names we would do that by

df_new = pd.get_dummies(df, columns=['Names'])
print(df_new)

You can refer to this documentation. By defining columns we would only encode columns of interest.

Encoded Output:

  Category  Val1  Val2  Names_Apple  Names_Banana  Names_Pear  Names_Watermelom
0        A    10    60            1             0           0                 0
1        B    20    70            0             1           0                 0
2        A    30    80            0             0           1                 0
3        B    30    90            0             0           0                 1