Here an explaination with some data that you can fetch in the package rpart
:
library(rpart) # for decision tree
library(rattle) # to do a nicer plot
progstat <- factor(stagec$pgstat, levels = 0:1, labels = c("No", "Prog"))
cfit <- rpart(progstat ~ age + eet + g2 + grade + gleason + ploidy,
data = stagec,
method ='class')
Question 1: why some variables are out?
Because those variables are not useful for your model or, said better, you've said to your model to not get variables under a parameter cp
(default = 0.01).
Looking at the doc for the cp
parameter:
(...)Essentially,the user informs the program that any split which does not
improve the fit by cp will likely be pruned off by cross-validation,
and that hence the program need not pursue it.
I think that the doc explain better technically than me and, if I have to tell it by simple words, the cp
parameter set the baseline of "utility" of a node.
If the node is made by a useless variable, it is cut out, so the useless (read: no further infos in the model by the variable) variables don't appear. Try to set the parameter in your model and you'll see how it change. In my case, the eet
variable is out.
Sending this:
summary(cfit)
Call:
rpart(formula = progstat ~ age + eet + g2 + grade + gleason +
ploidy, data = stagec, method = "class")
n= 146
CP nsplit rel error xerror xstd
1 0.10493827 0 1.0000000 1.0000000 0.1080241
2 0.05555556 3 0.6851852 1.0555556 0.1091597
3 0.02777778 4 0.6296296 0.9629630 0.1071508
4 0.01851852 6 0.5740741 0.9629630 0.1071508
5 0.01000000 7 0.5555556 0.9814815 0.1075992
Variable importance
g2 grade gleason ploidy age eet
30 28 20 13 7 2
(... it continues...)
You can see that eet
is the least important.
Question 2: what does aefg
in week_day
mean ?
It means that the split is made by some of the week_day
on the left, and some on the right. It should be a categorical variable.
Try to use this, instead the classical plot:
fancyRpartPlot(cfit, caption = NULL)
You can see that the ploid and tetraploid are sent to the left, the other, on the right. From here:
(...)The tree is arranged so that the “more severe” nodes go to the
right
Question 3: is there a way to describe the different classes (0 vs 1 for the Target variable)?
In this case the variable is progstat
, but you can transport the explaination to your variable.
This is how generally I read those results in the plot:
Looking at the first node (the most important): it tells us that 63% are "no", an 37% "prog" (read yes). That node covers the 100% of the population.
The second most important node is the 2, and the variable that take in is grade
< 2.5. Otherwise, you go in node three.
If you go on the left, you have the 42% of the population. The label of that population is No, but, 85% of the population is real No, the other are mislabelled No.
TL;DR
This mean that " The overall population is divided in No and Prog, at 63% and 27%.
If the variable grade
is < 2.5, the model says that in our data, 42% of population has that value of grade
, and in that 42%, 85% of them have has result No. Probably grade
and the result of the dependant variable "no" are bonded".
In this way you can check all the nodes in your plot and also using summary()
, to see the most important patterns.
In your plot, you can say that "if ntags
> 1.952 and nchar
< 0.1449, then I have a 0.