2
votes

I was implement logistic regression to the following data frame and got a reasonable (the same as using STATA) results. But the Pearson chi square and degree of freedom I got from R is very different from STATA, which in turn gave me an very small p-value. And I cannot get the area under ROC curve. Could anyone help me to find out why residual() does not work on glm() with priori weights, and how to deal with area under ROC curve?

Following is my code and output.

1. Data

Here is my data frame test_data, y is outcome, x1 and x2 are covariates:

 y     x1        x2     freq
 0     No        0      268
 0     No        1       14
 0    Yes        0      109
 0    Yes        1        1
 1     No        0       31
 1     No        1        6
 1    Yes        0       45
 1    Yes        1        6

I generated this data frame from the original data by counting occurrence of each covariate pattern, and store the number in new variable freq.

2. GLM Model

Then I did the logistic regression as:

logit=glm(y~x1+x2, data=test_data, family=binomial, weights=freq)

Output shows:

Deviance Residuals: 1 2 3 4 5 6 7 8
-7.501 -3.536 -8.818 -1.521 11.957 3.501 10.409 2.129

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2010 0.1892 -11.632 < 2e-16 ***

x1 1.3538 0.2516 5.381 7.39e-08 ***

x2 1.6261 0.4313 3.770 0.000163 ***


Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

   Null deviance: 457.35  on 7  degrees of freedom

Residual deviance: 416.96 on 5 degrees of freedom AIC: 422.96

Number of Fisher Scoring iterations: 5

Coefficients are the same as STATA.

3. Chi Square Statistics

when I tried to calculate the Pearson chi square:

pearson_chisq = sum(residuals(logit, type = "pearson", weights=test_data$freq)^2)

I got 488, instead of 1.3 given by STATA. Also the DOF generated by R is chisq_dof = df.residuals(logit)=5, instead of 1. So I got an extremely small p-value~e^-100.

4. Discrimination

Then I calculated the area under ROC curve as: library(verification)

logit_mf = model.frame(logit)

roc.area(logit_mf $y, fitted(logit))$A

The output is:

[1] 0.5

Warning message:

In wilcox.test.default(pred[obs == 1], pred[obs == 0], alternative = "great") : cannot compute exact p-value with ties

Thanks!

1
I think this is not so much a coding question as a request for statistical education and interpretation and as such belongs on CrossValidated.com. The warning message is a standard warning for the Wilcoxon tests. The p-values are usually asymptotically correct and it can usually be ignored. You would expect a large number of ties with such grouped data.IRTFM
Hi BondedDush, thanks for your replying. I anticipated so many ties, but R seems not capable of handling logistic regression using frequency table as STATA, is there a way to let R work out such diagnostics as STATA?J.Liu
I have no idea what you mean by "R seems not capable of handling logistic regression using frequency table". You already said the coefficients were the same. If you don't know how to handle the inferential tasks, that's not R's fault but reflects your limitations.IRTFM
You simply specified your glm model wrong. You can specify it in two different ways. One follows here: total<-with(test_data,freq[1:4]+freq[5:8]); logit=glm(freq/total~x1+x2, data=test_data[1:4,], family=binomial, weights=total); chi_test<-sum(residuals(logit, type = "pearson")^2);J.R.
@J.R.This is what I was looking for. I did not know the meaning of weights and should have extracted the covariate patterns before. Thanks!J.Liu

1 Answers

0
votes

I figured out how to solve this problem eventually. The data set I used above should be summarised to covariate patterns. Then use the definition of Pearson chi square to do calculation. I provide the R code as follows:

# extract covariate patterns
library(dplyr)
temp=test_data %>% group_by(x1, x2) %>% summarise(m=sum(freq),y=sum(freq*y)) temp$pred=fitted(p01_logit_j)[1:4]

# calculate Pearson chi square
temp=mutate(temp, pearson=(y-mpred)/sqrt(mpred*(1-pred)))
pearson_chi2 = with(temp, sum(pearson^2)) temp_dof = 4-(2+1) #dof=J-(p+1)

# calculate p-value
pchisq(pearson_chi2, temp_dof, lower.tail=F)

The result of p-value is 0.241941, which is same as STATA.

In order to calculate AUC, we should first expand the covariate pattern to the "original" data, then use the "expanded" data to get AUC. Noted we have 392 "0" and 88 "1" in the frequency table. My code follows:

# expand observation
y_expand=c(rep(0,392),rep(1,88))
logit_mf = model.frame(logit)
logit_pred = fitted(logit)
logit_mf$freq=test_data$freq

# expand prediction
yhat_expand=with(logit_mf, rep(pred, freq))
library(verification)
roc.area(y_expand, yhat)$A

AUC=0.6760, same as that of STATA.