2
votes

I am trying to understand the difference between two different fitting methods for a data set with a bounded response variable. The response variable is a fraction and therefore has a range of [0,1]. I have uncovered through my Google searching that there are a lot of different methods out there as this is a common operation. I am currently interested in the difference between the stock R GLM fit and the Beta regression offered in the betareg package. I am using the GasolineYield data set from the "betareg" package as my sample data set. Before I post the code and the results my two questions are the following:

  1. Am I performing the Logistic Regression fit in R using the builtin R GLM correctly?

  2. Why are the standard errors reported in the Beta regression so much smaller than the standard errors for the R logistic regression?

R Setup Code

library(betareg)
data("GasolineYield", package = "betareg")

Beta Regression code from the "betareg" package

gy = betareg(yield ~ batch + temp, data = GasolineYield)
summary(gy)

Beta Regression summary output

Call:
betareg(formula = yield ~ batch + temp, data = GasolineYield)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-2.8750 -0.8149  0.1601  0.8384  2.0483 

Coefficients (mean model with logit link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.1595710  0.1823247 -33.784  < 2e-16 ***
batch1       1.7277289  0.1012294  17.067  < 2e-16 ***
batch2       1.3225969  0.1179020  11.218  < 2e-16 ***
batch3       1.5723099  0.1161045  13.542  < 2e-16 ***
batch4       1.0597141  0.1023598  10.353  < 2e-16 ***
batch5       1.1337518  0.1035232  10.952  < 2e-16 ***
batch6       1.0401618  0.1060365   9.809  < 2e-16 ***
batch7       0.5436922  0.1091275   4.982 6.29e-07 ***
batch8       0.4959007  0.1089257   4.553 5.30e-06 ***
batch9       0.3857930  0.1185933   3.253  0.00114 ** 
temp         0.0109669  0.0004126  26.577  < 2e-16 ***

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)    
(phi)    440.3      110.0   4.002 6.29e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood:  84.8 on 12 Df
Pseudo R-squared: 0.9617
Number of iterations: 51 (BFGS) + 3 (Fisher scoring) 

R GLM Logistic Regression code from stock R

glmfit = glm(yield ~ batch + temp, data = GasolineYield, family = "binomial")
summary(glmfit)

R GLM Logistic Regression summary output

Call:
glm(formula = yield ~ batch + temp, family = "binomial", data = GasolineYield)

Deviance Residuals: 
      Min         1Q     Median         3Q        Max  
-0.100459  -0.025272   0.004217   0.032879   0.082113  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.130227   3.831798  -1.600    0.110
batch1       1.720311   2.127205   0.809    0.419
batch2       1.305746   2.481266   0.526    0.599
batch3       1.562343   2.440712   0.640    0.522
batch4       1.048928   2.152385   0.487    0.626
batch5       1.125075   2.176242   0.517    0.605
batch6       1.029601   2.229773   0.462    0.644
batch7       0.540401   2.294474   0.236    0.814
batch8       0.497355   2.288564   0.217    0.828
batch9       0.378315   2.494881   0.152    0.879
temp         0.010906   0.008676   1.257    0.209

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2.34184  on 31  degrees of freedom
Residual deviance: 0.07046  on 21  degrees of freedom
AIC: 36.631

Number of Fisher Scoring iterations: 5
1
You might want to try Cross Validated.Nicholas Flees
So you're just going to ignore that warning: Warning message: In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!Dason
Of course not. I just wanted to ask the obvious questions first (although that is also pretty obvious). I have recently determined that I have feed the data incorrectly to R for the GLM fit. As you see I did not specify the weights. Unfortunately, the data set does not have the number of events for each unique combination of factors. This may be the entire problem. But I wanted to see what other people said. I will be posting a similar (but different) question on Cross Validated...Justace Clutter
Yes - CrossValidated would be better suited for this question. Having a good understanding of the actual model you are fitting is important and the beta regression and logistic regression make different assumptions about the data.Dason
If you want to ask this at CrossValidated, flag it and ask for it to be moved.Glen_b

1 Answers

2
votes

The standard errors are different because the variance assumptions in the two models are different.

Logistic regression assumes the response has a binomial distribution, while beta regression assumes it has a beta distribution.

The variance functions of the two are different. For the binomial, if you specify the mean (and $n$ is a given) the variance is determined. For the beta there's another free parameter, so it isn't determined by the mean and would presumably be estimated from the data.

This suggests that if you fit a quasibinomial GLM (adding a variance parameter) you might get closer to the same standard errors, but they still won't be the same, since they would weight the observations differently.

What you should actually do:

  • if your proportions are originally counts divided by some total count, then a binomial GLM would be an appropriate model to consider. (You would need the total counts, though.)

  • if your proportions are continuous fractions (the proportion of milk that's cream for example), then beta regression is an appropriate model to consider.