1
votes

I'm trying to implement the decision tree algorithm based on the pseudo.

However, I don't understand why the first node should be outlook.

Shouldn't the gini index of Outlook be 1-(5/14)^2-(5/14)^2-(4/14)^2 = 0.663265306, and gini index of Humidity be =1-(4/14)^2-(6/14)^2-(4/14)^2 = 0.653061224?

As the gini index represents the impurity of the attribute, it is more reasonable to choose the attribute with the lower gini index.

Is my way of finding a gini index is wrong or there is something else I should know?

Data

Rainy   Hot High    FALSE   No
Rainy   Hot High    TRUE    No
Overcast    Hot High    FALSE   Yes
Sunny   Mild    High    FALSE   Yes
Sunny   Cool    Normal  FALSE   Yes
Sunny   Cool    Normal  TRUE    No
Overcast    Cool    Normal  TRUE    Yes
Rainy   Mild    High    FALSE   No
Rainy   Cool    Normal  FALSE   Yes
Sunny   Mild    Normal  FALSE   Yes
Rainy   Mild    Normal  TRUE    Yes
Overcast    Mild    High    TRUE    Yes
Overcast    Hot Normal  FALSE   Yes
Sunny   Mild    High    TRUE    No

Here is the pseudo code of Decision Tree that I'm building

Here is the answer

1

1 Answers

1
votes

The Gini measure is a measure of purity. For two classes, the minimum value is 0.5 for an equal split. The Gini measure then increases as the proportion of either class increases. When the Gini measure is 1, then the set is 100% pure in one or the other class.

The purpose of a decision tree is to maximize the purity of the children. After all, that is how the model "learns" to recognize the classes -- by separating them. Hence, you want the larger Gini index, not the smaller one. By your calculations, this would be Outlook rather than Humidity.