1
votes

I am trying to perform some logistic regression on the dataset provided here by using the 5-fold-cross-validation.

My goal is to make prediction over the Classification column of the dataset which can take the value 1 (if no cancer) and the value 2 (if cancer).

Here is the full code :

     library(ISLR)
     library(boot)
     dataCancer <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv")

     #Randomly shuffle the data
     dataCancer<-dataCancer[sample(nrow(dataCancer)),]
     #Create 5 equally size folds
     folds <- cut(seq(1,nrow(dataCancer)),breaks=5,labels=FALSE)
     #Perform 5 fold cross validation
     for(i in 1:5){
           #Segement your data by fold using the which() function 
           testIndexes <- which(folds == i)
           testData <- dataCancer[testIndexes, ]
           trainData <- dataCancer[-testIndexes, ]
           #Use the test and train data partitions however you desire...

           classification_model = glm(as.factor(Classification) ~ ., data = trainData,family = binomial)
           summary(classification_model)

           #Use the fitted model to do predictions for the test data
           model_pred_probs = predict(classification_model , testData , type = "response")
           model_predict_classification = rep(0 , length(testData))
           model_predict_classification[model_pred_probs > 0.5] = 1

           #Create the confusion matrix and compute the misclassification rate
           table(model_predict_classification , testData)
           mean(model_predict_classification != testData)
     }

I would like to have some help at the end

 table(model_predict_classification , testData)
 mean(model_predict_classification != testData)

I get the following error :

 Error in table(model_predict_classification, testData) : all arguments must have the same length

I don't understand very well how to use the confusion matrix.

I want to have 5 misclassification rate. The trainData and testData have been cut into 5 segments. The size should be equal to the model_predict_classification.

Thanks for your help.

1

1 Answers

3
votes

Here is a solution using the caret package to perform 5-fold cross validation on the cancer data after splitting it into test and training data sets. Confusion matrices are generated against both the test and training data.

caret::train() reports an average accuracy across the 5 hold out folds. The results for each individual fold can be obtained by extracting them from the output model object.

library(caret)
data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv")
# set classification as factor, and recode to 
# 0 = no cancer, 1 = cancer 
data$Classification <- as.factor((data$Classification - 1))
# split data into training and test, based on values of dependent variable 
trainIndex <- createDataPartition(data$Classification, p = .75,list=FALSE)
training <- data[trainIndex,]
testing <- data[-trainIndex,]
trCntl <- trainControl(method = "CV",number = 5)
glmModel <- train(Classification ~ .,data = training,trControl = trCntl,method="glm",family = "binomial")
# print the model info
summary(glmModel)
glmModel
confusionMatrix(glmModel)
# generate predictions on hold back data
trainPredicted <- predict(glmModel,testing)
# generate confusion matrix for hold back data
confusionMatrix(trainPredicted,reference=testing$Classification)

...and the output:

> # print the model info
> > summary(glmModel)
> 
> Call: NULL
> 
> Deviance Residuals: 
>     Min       1Q   Median       3Q      Max  
> -2.1542  -0.8358   0.2605   0.8260   2.1009  
> 
> Coefficients:
>               Estimate Std. Error z value Pr(>|z|)   (Intercept) -4.4039248  3.9159157  -1.125   0.2607   Age         -0.0190241  0.0177119  -1.074   0.2828   BMI         -0.1257962  0.0749341  -1.679   0.0932 . Glucose      0.0912229  0.0389587   2.342   0.0192 * Insulin      0.0917095  0.2889870   0.317   0.7510   HOMA        -0.1820392  1.2139114  -0.150   0.8808   Leptin      -0.0207606  0.0195192  -1.064   0.2875   Adiponectin -0.0158448  0.0401506  -0.395   0.6931   Resistin     0.0419178  0.0255536   1.640   0.1009   MCP.1        0.0004672  0.0009093   0.514   0.6074  
> --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> (Dispersion parameter for binomial family taken to be 1)
> 
>     Null deviance: 119.675  on 86  degrees of freedom Residual deviance:  89.804  on 77  degrees of freedom AIC: 109.8
> 
> Number of Fisher Scoring iterations: 7
> 
> > glmModel Generalized Linear Model 
> 
> 87 samples  9 predictor  2 classes: '0', '1' 
> 
> No pre-processing Resampling: Cross-Validated (5 fold)  Summary of
> sample sizes: 70, 69, 70, 69, 70  Resampling results:
> 
>   Accuracy   Kappa    
>   0.7143791  0.4356231
> 
> > confusionMatrix(glmModel) Cross-Validated (5 fold) Confusion Matrix 
> 
> (entries are percentual average cell counts across resamples)
>  
>           Reference Prediction    0    1
>          0 33.3 17.2
>          1 11.5 37.9
>                               Accuracy (average) : 0.7126
> 
> > # generate predictions on hold back data
> > trainPredicted <- predict(glmModel,testing)
> > # generate confusion matrix for hold back data
> > confusionMatrix(trainPredicted,reference=testing$Classification) Confusion Matrix and Statistics
> 
>           Reference Prediction  0  1
>          0 11  2
>          1  2 14
>                                           
>                Accuracy : 0.8621          
>                  95% CI : (0.6834, 0.9611)
>     No Information Rate : 0.5517          
>     P-Value [Acc > NIR] : 0.0004078       
>                                           
>                   Kappa : 0.7212            Mcnemar's Test P-Value : 1.0000000       
>                                           
>             Sensitivity : 0.8462          
>             Specificity : 0.8750          
>          Pos Pred Value : 0.8462          
>          Neg Pred Value : 0.8750          
>              Prevalence : 0.4483          
>          Detection Rate : 0.3793              Detection Prevalence : 0.4483          
>       Balanced Accuracy : 0.8606          
>                                           
>        'Positive' Class : 0               
>                                           
> >