Interpreting how R codifies dummy response variable in Logistic Regression

Question

I am a newbie, who is having trouble in interpreting the output of my logistic regression. My response variable has two values - “multiplex” and “subterraneus”. When used the factor() function on “microtus.train” data frame, I get “mutiplex and subterraneus” in that order. After I fitted the model, and predict the response, I am having trouble in understanding what does the probability mean. Do these probabilities mean probability of an observation being “subterraneus”? When I used “contrasts(microtus.train$Group)” statement, I got the table below.

> contrasts(microtus.train$Group)
             subterraneus
multiplex               0
subterraneus            1

Based on this table, I interpret that the model is trying to predict probabilities of “subterraneus” (not the probabilities of “multiplex”) because “1” is dummy coded for “subterraneus”. Is my assumption correct?

My code is given below and I appreciate your help in advance.

library(Flury)
data(microtus, package = "Flury")

str(microtus)
summary(microtus)

# Creating training & test data frames
microtus.train <- subset(microtus, 
                     microtus$Group %in% c("multiplex", "subterraneus"), 
                     select = c("Group", "M1Left", "M2Left", "M3Left", 
                                "Foramen", "Pbone","Length", "Height",
                                "Rostrum") )

# Drop 3rd factor level
microtus.train$Group = droplevels(microtus.train$Group)
factor(microtus.train$Group)


nullModel.GLM <- glm(Group ~ 1, data = microtus.train, 
                     family = binomial())
fullModel.GLM <- glm(Group ~ ., data = microtus.train, 
                     family = binomial())
summary(nullModel.GLM)
summary(fullModel.GLM)

stepFwd.GLM <- step(nullModel.GLM, scope = list(upper = fullModel.GLM), 
                    direction = 'forward', k = 2)
stepFwd.GLM.fitResults <- predict(stepFwd.GLM, type = 'response')
stepFwd.GLM.fitResults

contrasts(microtus.train$Group)

Ben Bolker Ben Bolker · Accepted Answer · 2017-12-10T01:43:11

It's not the contrasts that matter, but the order of the factor levels (contrasts specify how the predictor variables are encoded as dummy variables). From ?glm:

For ‘binomial’ and ‘quasibinomial’ families the response can also be specified as a ‘factor’ (when the first level denotes failure and all others success)

Since R defines the levels of factors in alphabetical order by default, "multiplex" is (probably) the first level and "subterraneus" is the second, hence the logistic regression is predicting the probability of "subterraneus". You can check this with levels(microtus$Group), and adjust it if necessary by using factor() with the levels argument set explicitly.

Interpreting how R codifies dummy response variable in Logistic Regression

1 Answers