Scikit Learn Categorical data with random forests

Question

I am trying to work with the titanic survival challenge in kaggle https://www.kaggle.com/c/titanic.

I am not experienced in R so i am using Python and Scikit Learn for the Random Forest Classifier

I am seeing many people using scikit learn converting their categorical of many levels into dummy variables.

I don't understand the point of doing this, why can't we just map the levels into a numeric value and be done with it.

And also i saw someone do the following: There was a categorical feature Pclass with three levels, he created 3 dummy variables for this and dropped the variable which had the least survival rate. I couldn't understand this either, i though decision trees didn't care about correlated features.

lanenok lanenok · Accepted Answer · 2016-08-29T14:08:02

If you just map levels to numeric values, python will treat your values as numeric. That is, numerically 1<2 and so on even if your levels were initially unordered. Think about the "distance" problem. This distance between 1 and 2 is 1, between 1 and 3 is 2. But what were the original distances between your categorical variables? For example, what are the distances between "banana" "peach" and "apple"? Do you suppose that they are all equal?

About dummy variable: if you have 3 classes and create 3 dummy variables, they not just correlated, they are linearly dependent. This is never good.

Scikit Learn Categorical data with random forests

1 Answers