0
votes

I'm trying to understand how to make a confusion matrix after I use the glm function for a logistic regression. Here is my code so far. I am using the caret package and the confusionMatrix function.

dput(head(wine_quality))

structure(list(fixed.acidity = c(7, 6.3, 8.1, 7.2, 7.2, 8.1), 
    volatile.acidity = c(0.27, 0.3, 0.28, 0.23, 0.23, 0.28), 
    citric.acid = c(0.36, 0.34, 0.4, 0.32, 0.32, 0.4), residual.sugar = c(20.7, 
    1.6, 6.9, 8.5, 8.5, 6.9), chlorides = c(0.045, 0.049, 0.05, 
    0.058, 0.058, 0.05), free.sulfur.dioxide = c(45, 14, 30, 
    47, 47, 30), total.sulfur.dioxide = c(170, 132, 97, 186, 
    186, 97), density = c(1.001, 0.994, 0.9951, 0.9956, 0.9956, 
    0.9951), pH = c(3, 3.3, 3.26, 3.19, 3.19, 3.26), sulphates = c(0.45, 
    0.49, 0.44, 0.4, 0.4, 0.44), alcohol = c(8.8, 9.5, 10.1, 
    9.9, 9.9, 10.1), quality = structure(c(4L, 4L, 4L, 4L, 4L, 
    4L), .Label = c("3", "4", "5", "6", "7", "8", "9", "white"
    ), class = "factor"), type = structure(c(3L, 3L, 3L, 3L, 
    3L, 3L), .Label = c("", "red", "white"), class = "factor"), 
    numeric_type = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 6L
), class = "data.frame")

library(tibble) 
library(broom) 
library(ggplot2)
library(caret)

any(is.na(wine_quality)) # this evaulates to FALSE 


wine_model <- glm(type ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +  chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, wine_quality, family = "binomial")


# split data into test and train

smp_size <- floor(0.75 * nrow(wine_quality))

set.seed(123)
train_ind <- sample(seq_len(nrow(wine_quality)), size = smp_size)

train <- wine_quality[train_ind, ]
test <- wine_quality[-train_ind, ]


# make prediction on train data

pred <- predict(wine_model)

train$fixed.acidity <- as.numeric(train$fixed.acidity)
round(train$fixed.acidity)
train$fixed.acidity <- as.factor(train$fixed.acidity)

pred <- as.numeric(pred)
round(pred)
pred <- as.factor(pred)

confusionMatrix(pred, wine_quality$fixed.acidity)

After this final line of code, I get this error:

Error: `data` and `reference` should be factors with the same levels.

This error doesn't make sense to me. I've tested that the length of pred and length of fixed.acidity are both the same (6497) and also they are both factor data type.

length(pred)
length(wine_quality$fixed.acidity)

class(pred)
class(train$fixed.acidity)

Is there any obvious reason why this confusion matrix is not working? I'm trying to find a hit ratio for the model. I would appreciate dummy explanations I really don't know what I'm doing here.

1

1 Answers

1
votes

The error from confusionMatrix() tells us that the two variables passed to the function need to be factors with the same values. We can see why we received the error when we run str() on both variables.

> str(pred)
 Factor w/ 5318 levels "-23.6495182533792",..: 310 339 419 1105 310 353 1062 942 594 1272 ...
> str(wine_quality$fixed.acidity)
 num [1:6497] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...

pred is a factor, when wine_quality$fixed_acidity is a numeric vector. The confusionMatrix() function is used to compare predicted and actual values of a dependent variable. It is not intended to cross tabulate a predicted variable and an independent variable.

Code in the question uses fixed.acidity in the confusion matrix when it should be comparing predicted values of type against actual values of type from the testing data.

Also, the code in the question creates the model prior to splitting the data into test and training data. The correct procedure is to split the data before building a model on the training data, make predictions with the testing (hold back) data, and compare actuals to predictions in the testing data.

Finally, the result of the predict() function as coded in the original post is the linear predicted values from the GLM model (equivalent to wine_model$linear.predictors in the output model object). These values must be further transformed to make them suitable before use in confusionMatrix().

In practice, it's easier to use caret::train() with the GLM method and binomial family, where predict() will generate results that are usable in confusionMatrix(). We'll illustrate this with the UCI wine quality data.

First, we download the data from the UCI Machine Learning Repository to make the example reproducible.

download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
              "./data/wine_quality_red.csv")
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
              "./data/wine_quality_white.csv")

Second, we load the data, assign type as either red or white depending on the data file, and bind the data into a single data frame.

red <- read.csv("./data/wine_quality_red.csv",header = TRUE,sep=";")
white <- read.csv("./data/wine_quality_white.csv",header = TRUE,sep=";")
red$type <- "red"
white$type <- "white"   
wine_quality <- rbind(red,white)
wine_quality$type <- factor(wine_quality$type)

Next, we split the data into test and training based on values of type so each data frame gets a proportional number of red and white wines, train the data with the default caret::train() settings and a GLM method.

library(caret)
set.seed(123)
inTrain <- createDataPartition(wine_quality$type, p = 3/4)[[1]]
training <- wine_quality[ inTrain,]
testing <- wine_quality[-inTrain,]

aModel <- train(type ~ .,data = training, method="glm", familia's = "binomial")

Finally, we use the model to make predictions on the hold back data frame, and run a confusion matrix.

testLM <- predict(aModel,testing)
confusionMatrix(data=testLM,reference=testing$type)

...and the output:

> confusionMatrix(data=testLM,reference=testing$type)
Confusion Matrix and Statistics

          Reference
Prediction  red white
     red    393     3
     white    6  1221
                                          
               Accuracy : 0.9945          
                 95% CI : (0.9895, 0.9975)
    No Information Rate : 0.7542          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.985           
                                          
 Mcnemar's Test P-Value : 0.505           
                                          
            Sensitivity : 0.9850          
            Specificity : 0.9975          
         Pos Pred Value : 0.9924          
         Neg Pred Value : 0.9951          
             Prevalence : 0.2458          
         Detection Rate : 0.2421          
   Detection Prevalence : 0.2440          
      Balanced Accuracy : 0.9913          
                                          
       'Positive' Class : red