Decision tree with categorical features

Question

I'm implementing a decision tree. Suppose "race" feature has the following possible values: ['Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'White', 'Other', 'Black']

Suppose the samples in a node has the following values for the "race" feature, and "race" is selected to be the best splitting feature right now.

['Asian-Pac-Islander', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'White', 'White', 'White', 'Other', 'Black']

Note that the values are grouped together - "sorted".

Suppose entropy diff tells me that the following is the best splitting position: (The vertical bar "|")

['Asian-Pac-Islander', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', | 'White', 'White', 'White', 'Other', 'Black']

What exactly is the split rule, then? It doesn't exactly make sense to make "asian-pac-islander" and "amer_indian_eskimo" go left and "white", "other" and "black" to go right because they are not numbers.

Thanks.

For categorical features, you can just pick one of them to split on. Why cannot you just use the "race" feature that gives you the max information gain? — greeness

Data Science Dojo Data Science Dojo · Accepted Answer · 2016-04-08T20:19:33

Remember that "left" and "right" children for decision tree nodes are arbitrary labels that humans use for visualization, not inherent mathematical properties of the trees. Flipping the left and right children of any node results in an identical (a mathematician would probably say "isomorphic") tree.

When splitting on a categorical attribute, you usually try every grouping of values and compare their Gini or information gain to determine the best split. Once you've established the best split, which group is the "left" group and which is the "right" group is randomly selected, because it doesn't matter.

It also looks like you're thinking of the split in the literal sense of drawing a dividing line in a specifically ordered list. For categorical attributes, you don't create splits this way. Instead, you define the split condition as, for example, "White, Other, and Black go left; all other Race labels go right". The order of data going into the split node should not affect the resulting split.

Decision tree with categorical features

1 Answers