1
votes

I'm using logistic regression to predict a binary outcome variable (Group, 0/1). So I've noticed something: I have two variable representing the same outcome, one is coded simply as "0" or "1".

> df$Group   
>[1] 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1
> 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 
> [59] 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1
> 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 
>[117] 0 0 0 1 1 1 1
> 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 0
> 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 
>[175] 1 0 1 
>Levels: 0 1
> is.factor(df$Group)   
> [1] TRUE

Same story for the other one which represents the same thing, but has "names" labels:

> df$Group   
>[1] CON CI     CON CI     CI     CON CI    
> CI     CON CI     CI     CI     CON CI      
>[15] CI  ecc.. ecc..    
> Levels: CI CON  
> is.factor(df$Group2)  
> [1] TRUE  
> contrasts(df$Group2)    
> CI        0  
> CON       1

In which 0 in the first variable =CON, whereas 1=CI. I created that first numerical variable because I wanted CI to be my "1" group, and CON the 0 reference group, but when I was transforming from the dataset, each time I tried to do "as.factor" what happened was CI=level 1, CON = level 2.

I thought they were the same thing, but when I tried to plot the odds ratio using sjPlot package, and just checked to be sure, I noticed that the OR were quite different, although by inspecting the coefficients of summary(glm model), everything seemed the same(apart from -or + of estimates, which makes sense as the two groups are coded differently). Specifically, when using the numerical variable the plotted OR are definitely bigger, whereas when using the "name" variable, the OR are smaller.

Am I missing something in the understanding of r (I'm self-thought) or in computation of logistic regression? Which one of the variables should I use in logistic regression? And how could I change the fact that in the "name" variables r uses "CI" as 0 reference group instead of CON? Thank you.

1
it would be much easier if you provide a full reproducible example with full code including also the glm's and the different outputs. e.g. have you specified the familiy? check ?familyRoman
Hi Roman, thank you for the answer. Yes I specified the family when computing the model i.e. glm(y ~ x, family= binomial, data= df).WannabeGandalf
here you can find some instructions for reproducible examples stackoverflow.com/questions/5963269/…desval
Does this answer your question? Logistic regression - defining reference level in RRoman

1 Answers

0
votes

By default, R uses alphabetical order for levels of factor. You can set your own order simply by

df$Group <- factor(df$Group, levels=c('CON','CI'))

Then CON would be used as reference level in logistic regression and you should get the same results as with 0/1 coding.