How to convert huge set of categorical data from string into numerical values automatically?

Question

I am trying to build a decision tree regression to predict the values of MSRP (Manufacturer Suggested Retail Price) for cars. However, I'm having problems with converting the categorical values into numerical values.

My problem: I have 8 columns of categorical features some columns having up to 40 different types of unique values and 20,000 instances. What method should I use to convert the categorical data to use for the decision tree regression? And is there any way to automatically input the unique value instead of inputting it manually?

I tried using LabelEncoder to convert the categorical values but for some reason, the array for df.values (BMW, Acura...) in the first column didn't change even after I transformed it.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
df = pd.read_excel(r'C:\Users\user\Desktop\data.xlsx')
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df.values[:, 0] = labelencoder.fit_transform(df.values[:, 0])

This is the result I got:

array([['BMW', '1 Series M', 2011, ..., 19, 3916, 46135],
       ['BMW', '1 Series', 2011, ..., 19, 3916, 40650],
       ['BMW', '1 Series', 2011, ..., 20, 3916, 36350],
       ...,
       ['Acura', 'ZDX', 2012, ..., 16, 204, 50620],
       ['Acura', 'ZDX', 2013, ..., 16, 204, 50920],
       ['Lincoln', 'Zephyr', 2006, ..., 17, 61, 28995]], dtype=object)

I want the first column to be in numerical values to be used for DT regression. Can anyone help? I'm doing this in my FYP and this is the first time I'm approaching machine learning.

Hima Hima · Accepted Answer · 2019-01-11T08:33:30

There are multiple ways to convert categorical data to numeric using pandas and sklearn:

pandas.get_dummies() (One Hot encoding)
Example:

import numpy as np
import pandas as pd

df = pd.DataFrame([['BMW', '1 Series M', 2011, 19, 3916, 46135],
       ['BMW', '1 Series', 2011,19, 3916, 40650],
       ['BMW', '1 Series', 2011,20, 3916, 36350],
       ['Acura', 'ZDX', 2012, 16, 204, 50620],
       ['Acura', 'ZDX', 2013, 16, 204, 50920],
       ['Lincoln', 'Zephyr', 2006, 17, 61, 28995]]) #Sample dataframe

pd.get_dummies(df, columns = [0,1,2]) #Dummies of 1st,2nd and 3rd column

Output

2.LabelEncoder
Example

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame([['BMW', '1 Series M', 2011, 19, 3916, 46135],
       ['BMW', '1 Series', 2011,19, 3916, 40650],
       ['BMW', '1 Series', 2011,20, 3916, 36350],
       ['Acura', 'ZDX', 2012, 16, 204, 50620],
       ['Acura', 'ZDX', 2013, 16, 204, 50920],
       ['Lincoln', 'Zephyr', 2006, 17, 61, 28995]]) #Sample dataframe

df[[0,1,2]].apply(LabelEncoder().fit_transform)

output (It will only give transformed Columns which needs to be combined with original dataframe)

df.loc[0:,0:2] = df[[0,1,2]].apply(LabelEncoder().fit_transform) 
#puts column back into dataframe

Output

How to convert huge set of categorical data from string into numerical values automatically?

2 Answers