decision tree prediction in R uses the id to influence the prediction

Question

I'm using C50 library to try and predict the attendance for next years' graduation however my tree shows the ID as part of the prediction! when i take it out my tree becomes one node (level) only .. any suggestions will be highly appreciated

part of the Dataset (JSON):

{"id":"50","name":"James Charlie","faculty":"Science","degree":"Bachelor degree","course":"Sport Science","attend":"No","year":"2016"},

full dataset/Student object: git repo

R script:

con=dbConnect(MySQL(), user = 'root', password = '', dbname='students', host = 'localhost') dbListTables(con) Student <- dbReadTable(con, 'students') rows <- nrow(Student)

    Student$attend <- as.factor(Student$attend) Student$year <- as.factor(Student$year) 
Student$faculty <- as.factor(Student$faculty) 
Student$course <- as.factor(Student$course) 
Student 
dim(Student) 
summary(Student)

str(Student)
    Student <- Student[-2]
    dim(Student)

str(Student)

set.seed(1234)
Student_rand <- Student[order(runif(719)), ] #randomize the data
Student_train <- Student_rand[1:400, ] #split data/train data to predect the test
Student_test  <- Student_rand[401:719, ] #validation for train prediction
summary(Student_train)


prop.table(table(Student_train$attend))#propability for prediction
prop.table(table(Student_test$attend))

Student_model <- C5.0(Student_train[,-5],Student_train$attend)

summary(Student_model)

Student_model
summary(Student_model)
jpeg("tree.jpg")
plot(Student_model)
dev.off()

Student_model$predictors
Student_model$trials
Student_model$tree
summary(Student_model)

Student_pred <- predict(Student_model, Student_test,type="class")

table(Student_test$attend ,Student_pred)
CrossTable(Student_pred, Student_test$attend,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('predicted default', 'actual default'))

and finally the tree:

**the first thing I tried was removing the id and i got the following error:

Error in partysplit(varid = as.integer(i), index = index, info = k, prob = NULL) *

: minimum of ‘index’ is not equal to 1 In addition: Warning message: In min(index, na.rm = TRUE) : no non-missing arguments to min; returning Inf

*

then i tried and added a random column which cause the prediction to use that random column as an inference..**

Try making the ID the row names instead of a column of its own — camille
If you need further help, make an example data set available (use dput on Student or head(Student, 30) if you have many observations). The 4 lines of json are insufficient to allow us to replicate your problem. — lmo
hello there, I added it as a sample so people will know what is the dataset i'm working with.. i will attach it in the question :) — jill
Your question implies you have an R object named Student. Instead of posting SQL/JSON code in your Github version of the data, why not use dput or dump to create an ASCII version of the actual R structure. Or you could post the R code that was used to read from that URL. — IRTFM
@42- i attach the object in the git directory/ and the full R code — jill

IRTFM IRTFM · Accepted Answer · 2018-04-22T19:50:43

Copied the structure and assigned to Student. The original dataframe which retains the association of name and id showed that there were many repeats of hte name variable:

str(Student[2])
#'data.frame':  724 obs. of  1 variable:
# $ name: chr  "Jill Austin" "David Beckham" "Chris Evans" "Sheldon Cooper" ...
length(table(Student[2]))
#[1] 201

Then I looked at the repeatingness in the first 165 ids and they had a very low probability if the id were less than 165:

length(table(Student[1:164, 2]))
[1] 163

So define a variable that flags repeats:

 Student$IsRepeated <- ave( Student$name, Student$name, FUN=length) > 1

And it then turns out that "name.repeatingness" was associated with attendance after the other predictors are taken into account.

> with( Student, table( attend, IsRepeated ) )
      IsRepeated
attend FALSE TRUE
   No     50  259
   Yes    59  356   # so nothing dramatic here, but try other predictors as well

First I looked at output of:

 with( Student, table(attend, year, IsRepeated , faculty) )

Kind of long, so I noticed that the Science and Engineering groups were somewhat different:

  with( Student, table(attend, year, IsRepeated , fac_EorS=faculty %in% c("Engineering", "Science") ) )
, , IsRepeated = FALSE, fac_EorS = FALSE

      year
attend 2015 2016 2017 2018
   No     0    0    0   10
   Yes    0    0    0   16

, , IsRepeated = TRUE, fac_EorS = FALSE

      year
attend 2015 2016 2017 2018
   No     9    9    9  131
   Yes   37   17   17  113

, , IsRepeated = FALSE, fac_EorS = TRUE

      year
attend 2015 2016 2017 2018
   No     0    0    1   39
   Yes    1    0    0   42

, , IsRepeated = TRUE, fac_EorS = TRUE

      year
attend 2015 2016 2017 2018
   No    34   34   33    0   # also shows how the `date` became the 2nd split
   Yes   45   32   32   63

decision tree prediction in R uses the id to influence the prediction

1 Answers