1
votes

I have been comparing Poisson, negative binomial (NB), and zero-inflated Poisson and NB models in R. My dependent variable is a symptom count for generalized anxiety disorder (GAD), and my predictors are two personality traits (disinhibition [ZDis_winz] and meanness [ZMean_winz]), their interaction, and covariates of age and assessment site (dummy-coded; there are 8 sites so I have 7 of these dummy variables). I have a sample of 1206 with full data (and these are the only individuals included in the data frame).

I am using NB models for this disorder because the variance (~40) far exceeds the mean (~4). I wanted to consider the possibility of a ZINB model as well, given that ~30% of the sample has 0 symptoms.

For other symptom counts (e.g., conduct disorder), I have run ZINB models perfectly fine in R, but I am getting an error when I do the exact same thing with the GAD model. The standard NB model works fine for GAD; it is only the GAD ZINB model that's erroring out.

Here is the error I'm receiving:

Error in solve.default(as.matrix(fit$hessian)) : system is computationally singular: reciprocal condition number = 4.80021e-36

Here is the code I'm using for the (working) NB model:

summary(
    NB_GAD_uw_int <- glm.nb(
        dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy, 
        data=eurodata))

Here is the code I'm using for the (not working) ZINB model (which is identical to other ZINB models I've run for other disorders):

summary(
    ZINB_GAD_uw_int <- zeroinfl(
        dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy, 
        data = eurodata, 
        dist = "negbin", 
        model = TRUE, 
        y = TRUE, x = TRUE))

I have seen a few other posts on StackOverflow and other forums about this type of issue. As far as I can tell, people generally say that this is an issue of either 1) collinear predictors or 2) too complex a model for too little data. (Please let me know if I am misinterpreting this! I'm fairly new to Poisson-based models.) However, I am still confused about these answers because: 1) In this case, none of my predictors are correlated more highly than .15, except for the main predictors of interest (ZMean_winz and ZDis_winz), which are correlated about .45. The same predictors are used in other ZINB models that have worked. 2) With 1206 participants, and having run the same ZINB model with similarly distributed count data for other disorders, I'm a little confused how this could be too complex a model for my data.

If anyone has any explanation for why this version of my model will not run and/or any suggestions for troubleshooting, I would really appreciate it! I am also happy to provide more info if needed.

Thank you so much!

1
Have you confirmed that you do, in fact, have counts of 0 and counts >0? Have you also checked that your x-variables do not perfectly predict the y-variable outcome?DanY
"1) collinear predictors or 2) too complex a model for too little data", these are issues for any kind of regression, not just poisson. Have you tried re-running the regression with a simpler model? I'd start by removing the dummy variables.AkselA
@DanY Thank you for your advice. Yes, I checked those things. The model is actually not intended to explain much of the variance in the outcome (it's for discriminant validity; the focus of the project is traits that contribute to conduct disorder symptoms), so it definitely doesn't perfectly explain the outcome. I appreciate your help!E. Perkins
@AkselA Thank you, you're right. It worked when I took out all the dummy variables; I'm just not totally sure why that improves it for only this particular outcome variable and model type. Any advice in that regard? Also, my goal is to show across outcome variables (different disorders) that these personality traits specifically affect conduct disorder, and less so GAD-- so I'm hoping to keep the models consistent across disorders. The others included all the predictors including the dummy codes. I might try reducing the dummy codes by coding for country instead in each model. Thank you!!E. Perkins
@E.Perkins Are you sure you've "dummified" the categorical "city" variable correctly? You can confirm by comparing with the output of model.matrix(...). I'm not sure about zeroinfl but glm.nb will dummify categorical variables internally, so there is no need to do that manually.Maurits Evers

1 Answers

0
votes

The problem may be that zeroinfl is not converting categorical variables into dummy variables.

You can dummify your variables using model.matrix, which is what glm, glm.nb, etc. call internally to dummify categorical variables. This is usually preferred over manually dummifying categorical variables, and should be done to avoid mistakes and to ensure full rank of your model matrix (a full rank matrix is non-singular).

You can of course dummify categorical variables yourself; in that case I would use model.matrix to transform your input data involving categorical variables (and potentially interactions between categorical variables and other variables) into the correct model matrix.

Here is an example:

set.seed(2017)
df <- data.frame(
    DV = rnorm(100),
    IV1_num = rnorm(100),
    IV2_cat = sample(c("catA", "catB", "catC"), 100, replace = T))
head(df)
#           DV     IV1_num IV2_cat
#1  1.43420148  0.01745491    catC
#2 -0.07729196  1.37688667    catC
#3  0.73913723 -0.06869535    catC
#4 -1.75860473  0.84190898    catC
#5 -0.06982523 -0.96624056    catB
#6  0.45190553 -1.96971566    catC


mat <- model.matrix(DV ~ IV1_num + IV2_cat, data = df)
head(mat)
#  (Intercept)     IV1_num IV2_catcatB IV2_catcatC
#1           1  0.01745491           0           1
#2           1  1.37688667           0           1
#3           1 -0.06869535           0           1
#4           1  0.84190898           0           1
#5           1 -0.96624056           1           0
#6           1 -1.96971566           0           1

The manually dummified input data would then be

df.dummified = cbind.data.frame(DV = df$DV, mat[, -1])
#           DV     IV1_num IV2_catB IV2_catC
#1  1.43420148  0.01745491        0        1
#2 -0.07729196  1.37688667        0        1
#3  0.73913723 -0.06869535        0        1
#4 -1.75860473  0.84190898        0        1
#5 -0.06982523 -0.96624056        1        0
#6  0.45190553 -1.96971566        0        1

which you'd use in e.g.

glm.nb(DV ~ ., data = df.dummified)