I am trying to perform some logistic regression on the dataset provided here by using the 5-fold-cross-validation.
My goal is to make prediction over the Classification column of the dataset which can take the value 1 (if no cancer) and the value 2 (if cancer).
Here is the full code :
library(ISLR)
library(boot)
dataCancer <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv")
#Randomly shuffle the data
dataCancer<-dataCancer[sample(nrow(dataCancer)),]
#Create 5 equally size folds
folds <- cut(seq(1,nrow(dataCancer)),breaks=5,labels=FALSE)
#Perform 5 fold cross validation
for(i in 1:5){
#Segement your data by fold using the which() function
testIndexes <- which(folds == i)
testData <- dataCancer[testIndexes, ]
trainData <- dataCancer[-testIndexes, ]
#Use the test and train data partitions however you desire...
classification_model = glm(as.factor(Classification) ~ ., data = trainData,family = binomial)
summary(classification_model)
#Use the fitted model to do predictions for the test data
model_pred_probs = predict(classification_model , testData , type = "response")
model_predict_classification = rep(0 , length(testData))
model_predict_classification[model_pred_probs > 0.5] = 1
#Create the confusion matrix and compute the misclassification rate
table(model_predict_classification , testData)
mean(model_predict_classification != testData)
}
I would like to have some help at the end
table(model_predict_classification , testData)
mean(model_predict_classification != testData)
I get the following error :
Error in table(model_predict_classification, testData) : all arguments must have the same length
I don't understand very well how to use the confusion matrix.
I want to have 5 misclassification rate. The trainData and testData have been cut into 5 segments. The size should be equal to the model_predict_classification.
Thanks for your help.