2
votes

I need help to interpret result in weka using the J48

I dont know how to explain the result, I am using the dataset Heart Disease Data Set from http://archive.ics.uci.edu/ml/datasets/Heart+Disease

And the J48 tree

Please help me, with some points importants for this analyse my result is:

=== Run information ===

  • Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
  • Relation: AnaliseCardiaca
  • Instances: 303
  • Attributes: 14
    • age
    • sex
    • cp
    • trestbps
    • chol
    • fbs
    • restecg
    • thalach
    • exang
    • oldpeak
    • slope
    • ca
    • thal
    • num Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree

cp <= 3
|   sex <= 0: 0 (57.0/2.0)
|   sex > 0
|   |   slope <= 1
|   |   |   fbs <= 0
|   |   |   |   trestbps <= 152
|   |   |   |   |   thalach <= 162
|   |   |   |   |   |   ca <= 1
|   |   |   |   |   |   |   age <= 56: 0 (12.0/1.0)
|   |   |   |   |   |   |   age > 56: 1 (3.0/1.0)
|   |   |   |   |   |   ca > 1: 1 (2.0)
|   |   |   |   |   thalach > 162: 0 (27.0)
|   |   |   |   trestbps > 152: 1 (4.0/1.0)
|   |   |   fbs > 0: 0 (9.0)
|   |   slope > 1
|   |   |   slope <= 2
|   |   |   |   ca <= 0
|   |   |   |   |   fbs <= 0
|   |   |   |   |   |   chol <= 261
|   |   |   |   |   |   |   oldpeak <= 2.5: 0 (11.61/1.0)
|   |   |   |   |   |   |   oldpeak > 2.5: 1 (3.0)
|   |   |   |   |   |   chol > 261: 1 (4.0)
|   |   |   |   |   fbs > 0: 0 (4.0)
|   |   |   |   ca > 0
|   |   |   |   |   thal <= 6: 1 (6.0/1.0)
|   |   |   |   |   thal > 6
|   |   |   |   |   |   thalach <= 145: 0 (3.39)
|   |   |   |   |   |   thalach > 145: 1 (5.0/1.0)
|   |   |   slope > 2: 0 (8.0/1.0)
cp > 3
|   thal <= 3
|   |   ca <= 2
|   |   |   exang <= 0
|   |   |   |   sex <= 0
|   |   |   |   |   chol <= 304: 0 (14.0)
|   |   |   |   |   chol > 304: 1 (3.0/1.0)
|   |   |   |   sex > 0
|   |   |   |   |   ca <= 0: 0 (10.0/1.0)
|   |   |   |   |   ca > 0: 1 (3.0)
|   |   |   exang > 0
|   |   |   |   restecg <= 1
|   |   |   |   |   slope <= 1: 0 (2.0)
|   |   |   |   |   slope > 1: 1 (5.37)
|   |   |   |   restecg > 1
|   |   |   |   |   ca <= 0: 0 (4.0)
|   |   |   |   |   ca > 0
|   |   |   |   |   |   ca <= 1
|   |   |   |   |   |   |   thalach <= 113: 0 (2.0)
|   |   |   |   |   |   |   thalach > 113: 1 (4.0)
|   |   |   |   |   |   ca > 1: 0 (2.0)
|   |   ca > 2: 1 (4.0)
|   thal > 3
|   |   fbs <= 0
|   |   |   ca <= 0
|   |   |   |   chol <= 278: 0 (23.0/8.0)
|   |   |   |   chol > 278: 1 (6.0)
|   |   |   ca > 0: 1 (46.0/12.0)
|   |   fbs > 0
|   |   |   ca <= 1: 1 (3.88)
|   |   |   ca > 1: 0 (11.75/4.75)

Number of Leaves : 31

Size of the tree : 61

Result img

2

2 Answers

5
votes

If you are using Weka Explorer, you can right click on the result row in the results list (located on the left of the window under the start button). Then select visualize tree. This will display an image of the tree.

If you still want to understand the results as they are shown in your question:

The results are displayed as tree. The root of the tree starts at the left and the first feature used is called cp. If cp is smaller or equal to 3, then the next feature in the tree is sex and so on. You can see that when you split by sex and sex <= 0 you reach a prediction. The prediction is 0 and the (57/2) means that 57 observations in the training set end up at this path and 2 were incorrectly classified, i.e. 55 had the label 0 and 2 had the label 1.

Here is how the start of the tree looks like:

                         --------start---------         
                         |                    |
                         |                    |
                         |cp > 3              | cp <= 3
                _________|______          ____|__________
                |              |          |              |
                |thal>3        |thal<=3   |sex>0         |sex<=0
                |              |          |              |
               ...            ...        ...         prediction 0 57(55,2)
2
votes

The AndreyF's explanation is good. I want to add some information.

Why does the tree have float numbers in its leaves? Can an instance (individual) be split and get a float value? (in the reality a person can not be split)

When the instance has all the attributes set perfectly then there isn't a problem. But when the instance has missing attributes, then the classifier (J48) doesn't know the way of the tree for that attribute.

For example, if an instance has its "oldpeak" attribute like a missing attribute then when it reaches the "chol <= 261" node (previous node to the "oldpeak" node) the classifier will divide the instance according to a probability and a percentage of the instance will go to "oldpeak <= 2.5" and the other percentage will go to "oldpeak > 2.5".

How does the classifier calculate that probability? It calculates through the instances that don't have the missing attribute for the actual node. For this example will be the "oldpeak" attribute.

If we have 25% instances with no missing "oldpeak" attribute that were classified in the "oldpeak <= 2.5" node, and we have 75% instances with no missing "oldpeak" attribute that were classified in the "oldpeak > 2.5" node then when the classifier wants to classify an instance with "oldpeak" attribute missing then the 25% of this instance will go through "oldpeak <= 2.5" and the rest (75%) will go through "oldpeak > 2.5".

You can try to remove instances with missing attributes and you will see that the tree will only have integer numbers instead of float numbers.

Thank you.