I am trying to build a decision tree regression to predict the values of MSRP (Manufacturer Suggested Retail Price) for cars. However, I'm having problems with converting the categorical values into numerical values.
My problem: I have 8 columns of categorical features some columns having up to 40 different types of unique values and 20,000 instances. What method should I use to convert the categorical data to use for the decision tree regression? And is there any way to automatically input the unique value instead of inputting it manually?
I tried using LabelEncoder to convert the categorical values but for some reason, the array for df.values (BMW, Acura...) in the first column didn't change even after I transformed it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_excel(r'C:\Users\user\Desktop\data.xlsx')
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df.values[:, 0] = labelencoder.fit_transform(df.values[:, 0])
This is the result I got:
array([['BMW', '1 Series M', 2011, ..., 19, 3916, 46135],
['BMW', '1 Series', 2011, ..., 19, 3916, 40650],
['BMW', '1 Series', 2011, ..., 20, 3916, 36350],
...,
['Acura', 'ZDX', 2012, ..., 16, 204, 50620],
['Acura', 'ZDX', 2013, ..., 16, 204, 50920],
['Lincoln', 'Zephyr', 2006, ..., 17, 61, 28995]], dtype=object)
I want the first column to be in numerical values to be used for DT regression. Can anyone help? I'm doing this in my FYP and this is the first time I'm approaching machine learning.