2
votes

I am replicating a negative binomial regression model in R. When calculating robust standard errors, the output does not match Stata output of standard errors.

The original Stata code is

nbreg displaced  eei lcostofwar cfughh roadskm lpopdensity ltkilled, robust nolog

I have attempted both manual calculation and vcovHC from sandwich. However, neither produces the same results.

My regression model is as follows: mod1 <- glm.nb(displaced ~ eei + costofwar_log + cfughh + roadskm + popdensity_log + tkilled_log, data = mod1_df)

With vcovHC I have tried every option from HC0 to HC5. Attempt 1:

cov_m1 <- vcovHC(mod1, type = "HC0", sandwich = T)
se <- sqrt(diag(cov_m1))

Attempt 2:

mod1_rob <- coeftest(mod1, vcovHC = vcov(mod1, type = "HC0"))

The most successful has been HC0 and vcov = sandwich but no SEs are correct.

Any suggestions?

EDIT

My output is as follows (using HC0):

                 Estimate Std. Error z value  Pr(>|z|)    
(Intercept)     1.3281183  1.5441312  0.8601  0.389730    
eei            -0.0435529  0.0183359 -2.3753  0.017536 *  
costofwar_log   0.2984376  0.1350518  2.2098  0.027119 *  
cfughh         -0.0380690  0.0130254 -2.9227  0.003470 ** 
roadskm         0.0020812  0.0010864  1.9156  0.055421 .  
popdensity_log -0.4661079  0.1748682 -2.6655  0.007688 ** 
tkilled_log     1.0949084  0.2159161  5.0710 3.958e-07 ***

The Stata output I am attempting to replicate is:

                 Estimate Std. Error    
(Intercept)     1.328     1.272
eei            -0.044     0.015 
costofwar_log   0.298     0.123  
cfughh         -0.038     0.018 
roadskm         0.002     0.0001   
popdensity_log -0.466     0.208 
tkilled_log     1.095     0.209  

The dataset is found here and the recoded variables are:

mod1_df <- table %>% 
  select(displaced, eei_01, costofwar, cfughh, roadskm, popdensity, 
tkilled)
mod1_df$popdensity_log <- log(mod1_df$popdensity + 1)
mod1_df$tkilled_log <- log(mod1_df$tkilled + 1)
mod1_df$costofwar_log <- log(mod1_df$costofwar + 1)
mod1_df$eei <- mod1_df$eei_01*100
2
Stata applies a degree of freedom correction that some R packages don't by default. That most likely causes the difference. As to your question; please include a minimal example and the Stata code you are running, othervise it is difficult for us to see what you are doingSAFEX
would be great to see the exact regression command that produce the Stata table as well. And also provide a dataset which we can useSAFEX

2 Answers

3
votes

Stata uses the observed Hessian for its computations, glm.nb() uses the expected Hessian. Therefore, the default bread() employed by the sandwich() function is different, leading to different results. There are other R packages that employ the observed hessian for its variance-covariance estimate (e.g., gamlss) but these do not supply an estfun() method for the sandwich package.

Hence, below I simply set up a dedicated bread_obs() function that extracts the ML estimates from a negbin object, sets up the negative log-likelihood, computes the observed Hessian numerically via numDeriv::hessian() and computes the "bread" from it (omitting the estimate for log(theta)):

bread_obs <- function(object, method = "BFGS", maxit = 5000, reltol = 1e-12, ...) {
  ## data and estimated parameters
  Y <- model.response(model.frame(object))
  X <- model.matrix(object)
  par <- c(coef(object), "log(theta)" = log(object$theta))

  ## dimensions
  n <- NROW(X)
  k <- length(par)

  ## nb log-likelihood
  nll <- function(par) suppressWarnings(-sum(dnbinom(Y,
    mu = as.vector(exp(X %*% head(par, -1))),
    size = exp(tail(par, 1)), log = TRUE)))

  ## covariance based on observed Hessian
  rval <- numDeriv::hessian(nll, par) 
  rval <- solve(rval) * n
  rval[-k, -k]
}

With that function I can compare the sandwich() output (based on the expected Hessian) with the output using the bread_obs() (based on the observed Hessian).

s_exp <- sandwich(mod1)
s_obs <- sandwich(mod1, vcov = bread_obs)
cbind("Coef" = coef(mod1), "SE (Exp)" = sqrt(diag(s_exp)), "SE (Obs)" = sqrt(diag(s_obs)))
##                  Coef SE (Exp) SE (Obs)
## (Intercept)     1.328    1.259    1.259
## eei            -0.044    0.017    0.015
## costofwar_log   0.298    0.160    0.121
## cfughh         -0.038    0.015    0.018
## roadskm         0.002    0.001    0.001
## popdensity_log -0.466    0.135    0.207
## tkilled_log     1.095    0.179    0.208

This still has slight differences compared to Stata but these are likely numerical differences from the optimization etc.

If you create a new dedicated bread() method for negbin objects

bread.negbin <- bread_obs

then the method dispatch will use this if you do sandwich(mod1).

2
votes

In R you need to manually provide a degree of freedom correction, so try this which I borrowed from this source:

dfa <- (G/(G - 1)) * (N - 1)/pm1$df.residual

# display with cluster VCE and df-adjustment
firm_c_vcov <- dfa * vcovHC(pm1, type = "HC0", cluster = "group", adjust = T)
coeftest(pm1, vcov = firm_c_vcov)

Here G is the number of Panels in your data set, N is the number of observations and pm1 is your model estimated. Obviously, you could drop the clustering.