1
votes

I have been using Weka’s J48 decision tree to classify frequencies of keywords in RSS feeds into target categories. And I think I may have a problem reconciling the generated decision tree with the number of correctly classified instances reported and in the confusion matrix.

For example, one of my .arff files contains the following data extracts:

@attribute Keyword_1_nasa_Frequency numeric
@attribute Keyword_2_fish_Frequency numeric
@attribute Keyword_3_kill_Frequency numeric
@attribute Keyword_4_show_Frequency numeric
...
@attribute Keyword_64_fear_Frequency numeric
@attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}

@data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
...
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,78,
0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
...
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,S

And so on: there’s a total of 64 keywords (columns) and 570 rows where each one contains the frequency of a keyword in a feed for a day. In this case, there are 57 feeds for 10 days giving a total of 570 records to be classified. Each keyword is prefixed with a surrogate number and postfixed with ‘Frequency’.

My use of the decision tree is with default parameters using 10x validation.

Weka reports the following:

Correctly Classified Instances         210               36.8421 %
Incorrectly Classified Instances       360               63.1579 %

With the following confusion matrix:

=== Confusion Matrix ===

   a   b   c   d   e   f   g   <-- classified as
  11   0   0   0  39   0   0 |   a = BFE
   0   0   0   0  60   0   0 |   b = FCL
   1   0   5   0  72   0   2 |   c = F
   0   0   1   0  69   0   0 |   d = M
   3   0   0   0 153   0   4 |   e = NCA
   0   0   0   0  90  10   0 |   f = SNT
   0   0   0   0  19   0  31 |   g = S

The tree is as follows:

Keyword_22_health_Frequency <= 0
|   Keyword_7_open_Frequency <= 0
|   |   Keyword_52_libya_Frequency <= 0
|   |   |   Keyword_21_job_Frequency <= 0
|   |   |   |   Keyword_48_pic_Frequency <= 0
|   |   |   |   |   Keyword_63_world_Frequency <= 0
|   |   |   |   |   |   Keyword_26_day_Frequency <= 0: NCA (461.0/343.0)
|   |   |   |   |   |   Keyword_26_day_Frequency > 0: BFE (8.0/3.0)
|   |   |   |   |   Keyword_63_world_Frequency > 0
|   |   |   |   |   |   Keyword_31_gaddafi_Frequency <= 0: S (4.0/1.0)
|   |   |   |   |   |   Keyword_31_gaddafi_Frequency > 0: NCA (3.0)
|   |   |   |   Keyword_48_pic_Frequency > 0: F (7.0)
|   |   |   Keyword_21_job_Frequency > 0: BFE (10.0/1.0)
|   |   Keyword_52_libya_Frequency > 0: NCA (31.0)
|   Keyword_7_open_Frequency > 0
|   |   Keyword_31_gaddafi_Frequency <= 0: S (32.0/1.0)
|   |   Keyword_31_gaddafi_Frequency > 0: NCA (4.0)
Keyword_22_health_Frequency > 0: SNT (10.0)

My question concerns reconciling the matrix to the tree or vice versa. As far as I understand the results, a rating like (461.0/343.0) indicates that 461 instances have been classified as NCA. But how can that be when the matrix reveals only 153? I am not sure how to interpret this so any help is welcome.

Thanks in advance.

1

1 Answers

2
votes

The number in parentheses at each leaf should be read as (number of total instances of this classification at this leaf / number of incorrect classifications at this leaf).

In your example for the first NCA leaf, it says there are 461 test instances that were classified as NCA, and of those 461, there were 343 incorrect classifications. So there are 461-343 = 118 correctly classified instances at that leaf.

Looking through your decision tree, note that NCA is also at other leaves. I count 118 + 3 + 31 + 4 = 156 correctly classified instances out of 461 + 3 + 31 + 4 = 499 total classifications of NCA.

Your confusion matrix shows 153 correct classifications of NCA out of 39 + 60 + 72 + 69 + 153 + 90 + 19 = 502 total classifications of NCA.

So there is a slight difference between the tree (156/499) and your confusion matrix (153/502).

Note that if you are running Weka from the command-line, it outputs a tree and a confusion matrix for testing on all the training data and also for testing with cross-validation. Be careful that you are looking at the right matrix for the right tree. Weka outputs both training and test results, resulting in two pairs of matrix and tree. You may have mixed them up.