I'm implementing a decision tree. Suppose "race" feature has the following possible values: ['Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'White', 'Other', 'Black']
Suppose the samples in a node has the following values for the "race" feature, and "race" is selected to be the best splitting feature right now.
['Asian-Pac-Islander', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'White', 'White', 'White', 'Other', 'Black']
Note that the values are grouped together - "sorted".
Suppose entropy diff tells me that the following is the best splitting position: (The vertical bar "|")
['Asian-Pac-Islander', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', | 'White', 'White', 'White', 'Other', 'Black']
What exactly is the split rule, then? It doesn't exactly make sense to make "asian-pac-islander" and "amer_indian_eskimo" go left and "white", "other" and "black" to go right because they are not numbers.
Thanks.