formula or matrix underlying count predictions using predprob in pscl

Question

The pscl package in R is used to fit, among other models, zero-inflated Poisson models.

I have a data frame on weekly food consumption, which is zero-inflated. I had no issues fitting the zero-inflated Poisson (ZIP) model, which successfully predicts the percentage of zero counts I have in my data. The data contains about 30% zero counts, with just over 4600 observations.

The ZIP model was:

 zeroinfl(LightC.home~SEX+AGEDET+Inhabitants, data=FoodAnalysis)

Where LightC is the weekly consumption count (range 0 - 74), SEX is respondent sex (factor with 2 levels), AGEDET is respondent ageband (factor with 12 levels, e.g. "5-9 years"), and Inhabitants is the number of people in the household (integer, range 1 to 8).

The ZIP model summary is:

Call:
zeroinfl(formula = LightC.home ~ SEX + AGEDET + Inhabitants, data = FoodAnalysis)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-3.3479 -1.2781 -0.5170  0.6342 12.2912 

Count model coefficients (poisson with log link):
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)             2.133213   0.029229  72.983  < 2e-16 ***
SEXFemale              -0.079156   0.010284  -7.697 1.39e-14 ***
AGEDET5-9 years        -0.038153   0.032745  -1.165 0.243963    
AGEDET10-14 years       0.014199   0.032411   0.438 0.661317    
AGEDET15-17 years       0.253942   0.035419   7.170 7.52e-13 ***
AGEDET18-24 years       0.148395   0.029089   5.101 3.37e-07 ***
AGEDET25-34 years       0.158506   0.026291   6.029 1.65e-09 ***
AGEDET35-44 years       0.157821   0.026043   6.060 1.36e-09 ***
AGEDET45-54 years       0.307299   0.026153  11.750  < 2e-16 ***
AGEDET55-64 years       0.340590   0.026913  12.655  < 2e-16 ***
AGEDET65-74 years       0.361976   0.027260  13.278  < 2e-16 ***
AGEDET75-84 years       0.251614   0.039866   6.311 2.76e-10 ***
AGEDET85 years or more  0.606829   0.083345   7.281 3.32e-13 ***
Inhabitants             0.014697   0.004302   3.416 0.000635 ***

Zero-inflation model coefficients (binomial with logit link):
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -0.82861    0.17389  -4.765 1.89e-06 ***
SEXFemale               -0.08333    0.07196  -1.158 0.246893    
AGEDET5-9 years         -0.27280    0.18082  -1.509 0.131390    
AGEDET10-14 years       -0.32820    0.18390  -1.785 0.074308 .  
AGEDET15-17 years       -0.06472    0.20589  -0.314 0.753276    
AGEDET18-24 years       -0.06973    0.16047  -0.435 0.663896    
AGEDET25-34 years       -0.27054    0.14791  -1.829 0.067385 .  
AGEDET35-44 years       -0.71412    0.15648  -4.564 5.03e-06 ***
AGEDET45-54 years       -0.50510    0.15585  -3.241 0.001191 ** 
AGEDET55-64 years       -0.65281    0.16541  -3.947 7.92e-05 ***
AGEDET65-74 years       -1.09276    0.18313  -5.967 2.41e-09 ***
AGEDET75-84 years       -1.28092    0.34508  -3.712 0.000206 ***
AGEDET85 years or more -12.75039  279.21031  -0.046 0.963577    
Inhabitants              0.01733    0.02872   0.603 0.546296    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 35 
Log-likelihood: -2.136e+04 on 28 Df

I used the predprob command in the pscl package to construct the vector of probabilities of consumption for the food. That constructed a matrix, one row per observation, with each observation given a probability of consumption for each count between 0 (minimum) and 74 (maximum). There are 165 unique rows in this matrix, which are due to the 165 unique combinations of predictors in the zero-inflated Poisson model.

How can I construct the formula underpinning those 75 probabilities for each combination of predictors (e.g. female, aged 25-34 years, 4 inhabitants)? I don't understand the relationship between the 75 probabilities from a mathematical perspective.

For example, the predictor combination SEX=="Female", AGEDET=="25-34 years" and Inhabitants == 4, gives these probabilities for each person with that predictor combination in the data (only first 6 of 75 shown):

        X0            X1           X2           X3          X4          X5
 0.2473285  0.0004504804  0.002183136  0.007053337  0.01709108  0.03313101

How were those probabilities estimated, given that the original zeroinfl formula looks like it's simply estimating the mean value of consumption for the combination of predictors and not all 75 probabilities?

Achim Zeileis Achim Zeileis · Accepted Answer · 2015-10-02T21:24:30

The zero-inflated Poisson model is a mixture of a point mass at zero and a Poisson distribution. See Section 2.3 of vignette("countreg", package = "pscl") for the exact formulas.

In your model there are two linear predictors: one for the mean of the Poisson component (with log link) and one for the zero-inflation probability (with logit link). Using the particular regressor vector indicated in your question we can compute both linear predictors and then apply the respective inverse link functions to obtain the mean mu and the zero-inflation probability pr:

mu <- exp(2.133213 -0.079156 + 0.158506 + 0.014697 * 4)
mu
## [1] 9.692487
pr <- plogis(-0.82861 -0.08333 -0.27054 + 0.01733 * 4)
pr
## [1] 0.2472822

And then we can enter these into the mixture density (see Equation 7 in the vignette:

y <- 0:5
pr * I(y == 0) + (1-pr) * dpois(y, mu)
## [1] 0.2473287134 0.0004504784 0.0021831278 0.0070533124 0.0170910338
## [6] 0.0331309230

formula or matrix underlying count predictions using predprob in pscl

1 Answers