Logistic regression confusion matrix problem

Question

I tried computing confusion-matrix for my glm model but I keep getting:

Error: data and reference should be factors with the same levels.

Below is my model:

model3 <- glm(winner ~ srs.1 + srs.2, data = train_set, family = binomial)
confusionMatrix(table(predict(model3, newdata=test_set, type="response")) >= 0.5,
                      train_set$winner == 1)

winner variable contains team1 and team2.
srs.1 and srs.2 are numerical values.

What is my problem here?

Check that just the predict() function is working as expected. If not, you need to ensure that the factor levels in test_set for srs.1 and srs.2 are the same (or are a subset of) the factor levels in train_set for the same variables. As an example, if your testing data has variable gender with factor levels "male" and "female", you can't have a factor level of "other" in the testing data. — DanY

StupidWolf StupidWolf · Accepted Answer · 2020-03-27T18:46:24

I suppose your winner label is a binary of 0,1. So let's use the example below:

library(caret)
set.seed(111)
data = data.frame(
srs.1 = rnorm(200),
srs.2 = rnorm(200)
)

data$winner = ifelse(data$srs.1*data$srs.2 > 0,1,0)

idx = sample(nrow(data),150)
train_set = data[idx,]
test_set = data[-idx,]

model3 <- glm(winner ~ srs.1 + srs.2, data = train_set, family = binomial)

Like you did, we try to predict, if > 0.5, it will be 1 else 0. You got the table() about right. Note you need to do it both for test_set, or train_set:

pred = as.numeric(predict(model3, newdata=test_set, type="response")>0.5)
ref = test_set$winner

confusionMatrix(table(pred,ref))

Confusion Matrix and Statistics

    ref
pred  0  1
   0 12  5
   1 19 14

               Accuracy : 0.52            
                 95% CI : (0.3742, 0.6634)
    No Information Rate : 0.62            
    P-Value [Acc > NIR] : 0.943973        

                  Kappa : 0.1085

Logistic regression confusion matrix problem

1 Answers