0
votes

I have run a logistic regression in R using glm to predict the likelihood that an individual in 1993 will have arthritis in 2004 (Arth2004) based on gender (Gen), smoking status (Smoke1993), hypertension (HT1993), high cholesterol (HC1993), and BMI (BMI1993) status in 1993. My sample size is n=7896. All variables are binary with 0 and 1 for false and true except BMI, which is continuous numeric. For gender, male=1 and female=0.

When I run the regression in R, I get good p-values, but when I actually use the regression for prediction, I get values greater than one quite often for very standard individuals. I apologize for the large code block, but I thought more information may be helpful.

library(ResourceSelection)
library(MASS)
data=read.csv(file.choose())
data$Arth2004 = as.factor(data$Arth2004)
data$Gen = as.factor(data$Gen)
data$Smoke1993 = as.factor(data$Smoke1993)
data$HT1993 = as.factor(data$HT1993)
data$HC1993 = as.factor(data$HC1993)
data$BMI1993 = as.numeric(data$BMI1993)

logistic <- glm(Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 + HT1993, data=data, family="binomial")

summary(logistic)

hoslem.test(logistic$y, fitted(logistic))

confint(logistic)

min(data$BMI1993)
median(data$BMI1993)
max(data$BMI1993)

e=2.71828

The output is as follows:

Call:
glm(formula = Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 + 
    HT1993, family = "binomial", data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0362  -1.0513  -0.7831   1.1844   1.8807  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.346104   0.158043 -14.845  < 2e-16 ***
Gen1        -0.748286   0.048398 -15.461  < 2e-16 ***
Smoke19931  -0.059342   0.064606  -0.919    0.358    
BMI1993      0.084056   0.006005  13.997  < 2e-16 ***
HC19931      0.388217   0.047820   8.118 4.72e-16 ***
HT19931      0.341375   0.058423   5.843 5.12e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10890  on 7895  degrees of freedom
Residual deviance: 10309  on 7890  degrees of freedom
AIC: 10321

Number of Fisher Scoring iterations: 4

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  logistic$y, fitted(logistic)
X-squared = 18.293, df = 8, p-value = 0.01913

Waiting for profiling to be done...
                  2.5 %      97.5 %
(Intercept) -2.65715966 -2.03756775
Gen1        -0.84336906 -0.65364134
Smoke19931  -0.18619647  0.06709748
BMI1993      0.07233866  0.09588198
HC19931      0.29454661  0.48200673
HT19931      0.22690608  0.45595006

[1] 18
[1] 26
[1] 43

A non-smoking female w/ median BMI (26), hypertension, and high cholesterol yields the following:

e^(26*0.084056+1*0.388217+1*0.341375-0*0.748286-0*0.059342-2.346104)

[1] 1.7664

I think the issue is related somehow to BMI considering that is the only variable that is numeric. Does anyone know why this regression produces probabilities greater than 1?

1
e^x is not the right inverse for the transformation in logistic regression. The transformation is with the logit function whose inverse is the expit function which should be e^x/(1+e^x). This seems to be a statistical misunderstanding rather than a programming problem. Also it's usually much easier to use the build in predict() function for calculating new values rather than doing it by hand like this in R.MrFlick
Thank you! That would explain it.Seth Kasten

1 Answers

1
votes

By default, family = "binomial" uses the logit link function (see ?family). So the probability you're looking for is 1.7664 / (1+1.7664).