3
votes

In my dataset I have a binary Target (0 or 1) variable, and 8 features: nchar, rtc, Tmean, week_day, hour, ntags, nlinks and nex. week_day is a factor while the others are numeric. I'm trying to build a decision tree classifier:

library(caTools)
set.seed(123)
split = sample.split(dataset$Target, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Feature Scaling
training_set[-c(2,4)] = scale(training_set[-c(2,4)])
test_set[-c(2,4)] = scale(test_set[-c(2,4)])

# Fitting Decision Tree Classification to the Training set
# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Target ~ .,
                   data = training_set)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-2], type = 'class')

# Making the Confusion Matrix
cm = table(test_set[, 2], y_pred)

plot(classifier, uniform=TRUE,margin=0.2)

The result of the plot is the following:

enter image description here

I have three questions I don't know the answers to:

  1. why in the plot some variables are missing? (E.g. rtc)
  2. what does aefg in week_day mean?
  3. is there a way to describe the different classes (0 vs 1 for the Target variable)? For example: in Target=1 we have all the rows that have nchar>0.19 and ntags>1.9, etc.
1

1 Answers

3
votes

Here an explaination with some data that you can fetch in the package rpart:

library(rpart)   # for decision tree
library(rattle)  # to do a nicer plot

 progstat <- factor(stagec$pgstat, levels = 0:1, labels = c("No", "Prog"))
 cfit     <- rpart(progstat ~  age + eet + g2 + grade + gleason + ploidy,
                   data = stagec,
                   method ='class')

Question 1: why some variables are out?
Because those variables are not useful for your model or, said better, you've said to your model to not get variables under a parameter cp (default = 0.01).
Looking at the doc for the cp parameter:

(...)Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it.

I think that the doc explain better technically than me and, if I have to tell it by simple words, the cp parameter set the baseline of "utility" of a node.
If the node is made by a useless variable, it is cut out, so the useless (read: no further infos in the model by the variable) variables don't appear. Try to set the parameter in your model and you'll see how it change. In my case, the eet variable is out.
Sending this:

 summary(cfit)
Call:
rpart(formula = progstat ~ age + eet + g2 + grade + gleason + 
    ploidy, data = stagec, method = "class")
  n= 146 

          CP nsplit rel error    xerror      xstd
1 0.10493827      0 1.0000000 1.0000000 0.1080241
2 0.05555556      3 0.6851852 1.0555556 0.1091597
3 0.02777778      4 0.6296296 0.9629630 0.1071508
4 0.01851852      6 0.5740741 0.9629630 0.1071508
5 0.01000000      7 0.5555556 0.9814815 0.1075992

Variable importance
     g2   grade gleason  ploidy     age     eet 
     30      28      20      13       7       2 

(... it continues...)

You can see that eet is the least important.

Question 2: what does aefg in week_day mean ?
It means that the split is made by some of the week_day on the left, and some on the right. It should be a categorical variable.
Try to use this, instead the classical plot:

fancyRpartPlot(cfit, caption = NULL)

enter image description here

You can see that the ploid and tetraploid are sent to the left, the other, on the right. From here:

(...)The tree is arranged so that the “more severe” nodes go to the right

Question 3: is there a way to describe the different classes (0 vs 1 for the Target variable)?
In this case the variable is progstat, but you can transport the explaination to your variable.
This is how generally I read those results in the plot:

Looking at the first node (the most important): it tells us that 63% are "no", an 37% "prog" (read yes). That node covers the 100% of the population.

The second most important node is the 2, and the variable that take in is grade < 2.5. Otherwise, you go in node three.

If you go on the left, you have the 42% of the population. The label of that population is No, but, 85% of the population is real No, the other are mislabelled No.

TL;DR
This mean that " The overall population is divided in No and Prog, at 63% and 27%.
If the variable grade is < 2.5, the model says that in our data, 42% of population has that value of grade, and in that 42%, 85% of them have has result No. Probably grade and the result of the dependant variable "no" are bonded".
In this way you can check all the nodes in your plot and also using summary(), to see the most important patterns.

In your plot, you can say that "if ntags > 1.952 and nchar < 0.1449, then I have a 0.