I have run a logistic regression in R using glm to predict the likelihood that an individual in 1993 will have arthritis in 2004 (Arth2004) based on gender (Gen), smoking status (Smoke1993), hypertension (HT1993), high cholesterol (HC1993), and BMI (BMI1993) status in 1993. My sample size is n=7896. All variables are binary with 0 and 1 for false and true except BMI, which is continuous numeric. For gender, male=1 and female=0.
When I run the regression in R, I get good p-values, but when I actually use the regression for prediction, I get values greater than one quite often for very standard individuals. I apologize for the large code block, but I thought more information may be helpful.
library(ResourceSelection)
library(MASS)
data=read.csv(file.choose())
data$Arth2004 = as.factor(data$Arth2004)
data$Gen = as.factor(data$Gen)
data$Smoke1993 = as.factor(data$Smoke1993)
data$HT1993 = as.factor(data$HT1993)
data$HC1993 = as.factor(data$HC1993)
data$BMI1993 = as.numeric(data$BMI1993)
logistic <- glm(Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 + HT1993, data=data, family="binomial")
summary(logistic)
hoslem.test(logistic$y, fitted(logistic))
confint(logistic)
min(data$BMI1993)
median(data$BMI1993)
max(data$BMI1993)
e=2.71828
The output is as follows:
Call:
glm(formula = Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 +
HT1993, family = "binomial", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0362 -1.0513 -0.7831 1.1844 1.8807
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.346104 0.158043 -14.845 < 2e-16 ***
Gen1 -0.748286 0.048398 -15.461 < 2e-16 ***
Smoke19931 -0.059342 0.064606 -0.919 0.358
BMI1993 0.084056 0.006005 13.997 < 2e-16 ***
HC19931 0.388217 0.047820 8.118 4.72e-16 ***
HT19931 0.341375 0.058423 5.843 5.12e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10890 on 7895 degrees of freedom
Residual deviance: 10309 on 7890 degrees of freedom
AIC: 10321
Number of Fisher Scoring iterations: 4
Hosmer and Lemeshow goodness of fit (GOF) test
data: logistic$y, fitted(logistic)
X-squared = 18.293, df = 8, p-value = 0.01913
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -2.65715966 -2.03756775
Gen1 -0.84336906 -0.65364134
Smoke19931 -0.18619647 0.06709748
BMI1993 0.07233866 0.09588198
HC19931 0.29454661 0.48200673
HT19931 0.22690608 0.45595006
[1] 18
[1] 26
[1] 43
A non-smoking female w/ median BMI (26), hypertension, and high cholesterol yields the following:
e^(26*0.084056+1*0.388217+1*0.341375-0*0.748286-0*0.059342-2.346104)
[1] 1.7664
I think the issue is related somehow to BMI considering that is the only variable that is numeric. Does anyone know why this regression produces probabilities greater than 1?
e^x
is not the right inverse for the transformation in logistic regression. The transformation is with thelogit
function whose inverse is theexpit
function which should bee^x/(1+e^x)
. This seems to be a statistical misunderstanding rather than a programming problem. Also it's usually much easier to use the build inpredict()
function for calculating new values rather than doing it by hand like this in R. – MrFlick