0
votes

I am fairly new to the R world. I need some help with correct syntax for running a negative binomial regression/zero-inflation negative binomial regression on some data.

I am trying to run a regression that looks at how race/ethnicity influences the number of summer meal sites in a census tract.

I have parsed all the necessary data into one "MASTER.csv" file and the format is as follows:

Column headers: GEO ID - number of Summer Meal Sites - Census Tract Name - Total Population - White - Black - Indian - Asian - Other

So an example row would look like: 48001950100 - 4 - Census Tract 9501, Anderson County, Texas - 5477 - 4400 - 859 - 14 - 21 - 0

And so on, I have a total of 5266 rows each in the same format. Race/ethnicity is reported as a count of how many individuals in that certain census tract are of a respective race/ethnicity.

I am using a zero-inflation negative binomial model to account for the dependent variable being a "count", and therefore susceptible to skewed distributions.

My dependent variable is the number of summer meal sites in each census tract. (ex. in this case, the second column, 4).

My independent variable would be the race/ethnicities. Black, White etc.. I also need to set White as my omitted ( or reference) variable since I am running a regression on nominal variables.

How would I go about doing this? Would it look similar to the code posted below?

require(MASS)
require(pscl)

zeroinfl(formula = MASTER$num_summer_meal_sites ~ .| MASTER$White + MASTER$Black +      MASTER$Other, data = "MASTER", dist = "negbin")

Would this do what I need? Also, I am unclear as to how I should set "White" as the reference/omitted variable.

1
First of all, you should not use MASTER$... in the formula, use, e.g., num_summer_meal_sites ~ . | White + Black + Other. Second, your model doesn't make sense: by specifying, e.g., y ~ . you are implicitly including all columns of MASTER, including GEOID, the county names, and total population. If you want more help, you'll have to provide you data - upload it somewhere and edit your question to provide a link. - jlhoward
If you have a column for each ethnicity (White, Black, Indian, Asian, Other) and you have data by census tract, then ethnicity is not a nominal variable/you don't have a reference category as such (that would apply if modeling at the individual level). However, since the total (White+Black+Indian+Asian+Other) adds up to the population size, these predictors probably will be correlated, so you might want to omit White explicitly. I would suggest formula = num_summer_meal_sites ~ Black + Indian + Asian + Other, data=MASTER - Ben Bolker

1 Answers

3
votes

As pointed out above, you have a few problems with your formula. It probably should be re-written as

zeroinfl(num_summer_meal_sites ~ Black + Indian + Asian + Other, 
    data = MASTER, dist = "negbin")

Here we specify the data= parameter with the actual data.frame variable, not a character indicating the name of the variable. This allows you to use the names of the columns without having to prefix them all with a data.frame name; it will use data= first.

Also, rather than using "." to indicate all variables, it would be useful in this case to explicitly list the covariates you want since some seem like they may be in appropriate for regression.

And as pointed out above, it's best not to include correlated variables in a regression model. So leaving out White will help to prevent that. Since you have summary data, you don't really have a reference category like you would if you had individual data.

zeroinfl uses the | to delimit the regressors for the poisson part and the zero inflated part. Did you only want to model the inflation with the race variables? If so, your formulation was appropriate.