I was implement logistic regression to the following data frame and got a reasonable (the same as using STATA) results. But the Pearson chi square and degree of freedom I got from R is very different from STATA, which in turn gave me an very small p-value. And I cannot get the area under ROC curve. Could anyone help me to find out why residual() does not work on glm() with priori weights, and how to deal with area under ROC curve?
Following is my code and output.
1. Data
Here is my data frame test_data
, y is outcome, x1 and x2 are covariates:
y x1 x2 freq 0 No 0 268 0 No 1 14 0 Yes 0 109 0 Yes 1 1 1 No 0 31 1 No 1 6 1 Yes 0 45 1 Yes 1 6
I generated this data frame from the original data by counting occurrence of each covariate pattern, and store the number in new variable freq
.
2. GLM Model
Then I did the logistic regression as:
logit=glm(y~x1+x2, data=test_data, family=binomial, weights=freq)
Output shows:
Deviance Residuals: 1 2 3 4 5 6 7 8
-7.501 -3.536 -8.818 -1.521 11.957 3.501 10.409 2.129Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2010 0.1892 -11.632 < 2e-16 ***x1 1.3538 0.2516 5.381 7.39e-08 ***
x2 1.6261 0.4313 3.770 0.000163 ***
Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 457.35 on 7 degrees of freedom
Residual deviance: 416.96 on 5 degrees of freedom AIC: 422.96
Number of Fisher Scoring iterations: 5
Coefficients are the same as STATA.
3. Chi Square Statistics
when I tried to calculate the Pearson chi square:
pearson_chisq = sum(residuals(logit, type = "pearson", weights=test_data$freq)^2)
I got 488, instead of 1.3 given by STATA. Also the DOF generated by R is chisq_dof = df.residuals(logit)
=5, instead of 1. So I got an extremely small p-value~e^-100.
4. Discrimination
Then I calculated the area under ROC curve as:
library(verification)
logit_mf = model.frame(logit)
roc.area(logit_mf $y, fitted(logit))$A
The output is:
[1] 0.5 Warning message:
In wilcox.test.default(pred[obs == 1], pred[obs == 0], alternative = "great") : cannot compute exact p-value with ties
Thanks!
total<-with(test_data,freq[1:4]+freq[5:8]); logit=glm(freq/total~x1+x2, data=test_data[1:4,], family=binomial, weights=total); chi_test<-sum(residuals(logit, type = "pearson")^2);
– J.R.