Correcting for robust/clustered standard errors within the lm function or replacing the results

Question

A while ago, I asked this question, which was about correcting the standard errors when using IV/2SLS and the first stage has a tobit distribution, on which I got an amazing answer from jay.sf (example data at the bottom). He provided me with the following function:

vcov2sls <- function(s1, s2, data, type=2) {
  ## get y names
  y1.nm <- gsub(".*=\\s(.*)(?=\\s~).*", "\\1", deparse(s1$call)[1], perl=TRUE)
  y2.nm <- as.character(s2$terms)[2]
  ## auxilliary model matrix
  X <- cbind(`(Intercept)`=1, data[, y1.nm, F], model.matrix(s2)[,-(1:2)])
  ## get y
  y <- data[, y2.nm] 
  ## betas second stage
  b <- s2$coefficients
  ## calculate corrected sums of squares
  sse <- sum((y - b %*% t(X))^2)
  rmse <- sqrt(mean(s2$residuals^2))  ## RMSE 2nd stage
  V0 <- vcov(s2)  ## biased vcov 2nd stage
  dof <- s2$df.residual  ## degrees of freedom 2nd stage
  ## calculate corrected RMSE
  rmse.c <- sqrt(sse/dof)
  ## calculate corrected vcov
  V <- (rmse.c/rmse)^2 * V0
  return(V)
}

So what I am looking for, is a function in which I can provide both the vcov matrix (the vcov2sls), and have robust and clustered standard errors. However it seems that they both pertain to the same vcov matrix option. So if I supply one, I already have to make sure the se's are clustered and robust.. So I guess I am essentially asking how I can make the vcov2sls function have robust and clustered errors. Obviously any other solution leading to the same practical outcome would be great as well.

However I want to use jay.sf's function, when the first stage is an lm, the clustering takes part in the summary (source), for example:

first_stage_ols <- lm(y~x, data=DT)
summary(first_stage_ols, robust=T)

Is there either, a way to correct the standard errors from within the lm function , or (replaced them in the result), or adapt the vcov2sls function to also account for robust/clustered standard errors?

EDIT: I know that also lmtest:coeftest exists, but I want to able to use weights. See this link. I am having trouble figuring out if this is possible in lmtest:coeftest .

EDIT I - Trying testers solution

So I tried testers answer in two situations. Firstly where I move from a tobit to a lm, and the other vice versa.

# Tobit -> LM

library(lmtest)
library(sandwich)
## run with lm ##
s1.tobit <- AER::tobit(taxrate ~ votewon + industry + size + urbanisation + vote, data=DF)
# cluster and adjust ses
s1.robust <-  vcovCL(s1.tobit, cluster = ~ industry)
s1.robust.se <- sqrt(diag(s1.robust))

s1.summary <- summary(s1.tobit)
s1.summary$coefficients[, 2] <- s1.robust.se

yhat <- fitted(s1.tobit)

s2.lm <- lm(sales ~ yhat + industry + size + urbanisation + vote, data=DF)

lmtest::coeftest(s2.lm, vcov.=vcov2sls(s1.summary, s2.lm, DF))

# WORKS!

Vice versa:

# LM -> tobit

library(lmtest)
library(sandwich)
## run with lm ##
s1.lm <- lm(taxrate ~ votewon + industry + size + urbanisation + vote, data=DF)
# cluster and adjust ses
s1.robust <-  vcovCL(s1.lm, cluster = ~ industry)
s1.robust.se <- sqrt(diag(s1.robust))

s1.summary <- summary(s1.lm)
s1.summary$coefficients[, 2] <- s1.robust.se

yhat <- fitted(s1.lm)

s2.tobit <- AER::tobit(sales ~ yhat + industry + size + urbanisation + vote, data=DF)

and then ????

# DOES NOT WORK, NO WAY TO ADD THE VCOV TO TOBIT

END OF EDIT

EDIT II - Testing the p-values between the lm_robust and manual

When using lm_robust the result of the first stage is as follows:

                Estimate Std. Error    t value    Pr(>|t|)   CI Lower   CI Upper       DF
(Intercept)   25.3890287  2.1726518 11.6857327 0.009184928  15.393996 35.3840612 1.870781
votewon       -0.9900966  2.1099738 -0.4692459 0.687605609 -10.636404  8.6562105 1.882014
industryE     -0.7564888  0.3710393 -2.0388372 0.184868777  -2.434709  0.9217314 1.901678
industryF     -2.6639323  0.3058024 -8.7112866 0.013649538  -4.002800 -1.3250647 1.964416
size          -0.5291956  0.5523497 -0.9580807 0.443894805  -3.036862  1.9784705 1.894753
urbanisationB -1.5851495  2.2454251 -0.7059463 0.554845739 -11.464414  8.2941148 1.954657
urbanisationC -2.7234541  0.3573827 -7.6205532 0.020124544  -4.365749 -1.0811587 1.872744
vote           3.1749142  2.4600297  1.2906000 0.341874112  -9.076839 15.4266675 1.740353

However when doing the manual calculations the p-values are very different:

s1.summary

Call:
lm(formula = taxrate ~ votewon + industry + size + urbanisation + 
    vote, data = DF)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2747  -4.3212  -0.6788   4.3677  10.7369 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    25.3890     2.1506  13.845   <2e-16 ***
votewon        -0.9901     2.1742  -0.676   0.5007    
industryE      -0.7565     0.3492  -0.557   0.5792    
industryF      -2.6639     0.2877  -1.855   0.0668 .  
size           -0.5292     0.5109  -1.250   0.2145    
urbanisationB  -1.5851     2.2311  -1.098   0.2753    
urbanisationC  -2.7235     0.3474  -1.704   0.0918 .  
vote            3.1749     2.4840   2.105   0.0380 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.623 on 92 degrees of freedom
Multiple R-squared:  0.1054,    Adjusted R-squared:  0.03734 
F-statistic: 1.549 on 7 and 92 DF,  p-value: 0.1609

And this is only for the first stage.

Example Data

DF <- structure(list(country = c("C", "C", "C", "C", "J", "J", "B", 
"B", "F", "F", "E", "E", "D", "D", "F", "F", "I", "I", "J", "J", 
"E", "E", "C", "C", "I", "I", "I", "I", "I", "I", "C", "C", "H", 
"H", "J", "J", "G", "G", "J", "J", "I", "I", "C", "C", "D", "D", 
"A", "A", "G", "G", "E", "E", "J", "J", "G", "G", "I", "I", "I", 
"I", "J", "J", "G", "G", "E", "E", "G", "G", "E", "E", "F", "F", 
"I", "I", "B", "B", "E", "E", "H", "H", "B", "B", "A", "A", "I", 
"I", "I", "I", "F", "F", "E", "E", "I", "I", "J", "J", "D", "D", 
"F", "F"), year = c(2005, 2010, 2010, 2005, 2005, 2010, 2010, 
2005, 2010, 2005, 2005, 2010, 2010, 2005, 2005, 2010, 2005, 2010, 
2005, 2010, 2010, 2005, 2010, 2005, 2005, 2010, 2005, 2010, 2010, 
2005, 2010, 2005, 2005, 2010, 2010, 2005, 2005, 2010, 2005, 2010, 
2005, 2010, 2005, 2010, 2010, 2005, 2005, 2010, 2010, 2005, 2010, 
2005, 2010, 2005, 2010, 2005, 2010, 2005, 2010, 2005, 2010, 2005, 
2010, 2005, 2010, 2005, 2010, 2005, 2005, 2010, 2005, 2010, 2005, 
2010, 2005, 2010, 2005, 2010, 2005, 2010, 2010, 2005, 2005, 2010, 
2005, 2010, 2010, 2005, 2010, 2005, 2010, 2005, 2005, 2010, 2005, 
2010, 2010, 2005, 2010, 2005), sales = c(15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9), industry = c("D", 
"D", "E", "E", "F", "F", "F", "F", "D", "D", "E", "E", "D", "D", 
"E", "E", "F", "F", "F", "F", "D", "D", "F", "F", "E", "E", "D", 
"D", "D", "D", "E", "E", "F", "F", "D", "D", "E", "E", "E", "E", 
"D", "D", "E", "E", "D", "D", "D", "D", "E", "E", "D", "D", "F", 
"F", "D", "D", "D", "D", "E", "E", "D", "D", "E", "E", "D", "D", 
"D", "D", "D", "D", "F", "F", "F", "F", "E", "E", "D", "D", "E", 
"E", "F", "F", "E", "E", "F", "F", "E", "E", "F", "F", "D", "D", 
"D", "D", "D", "D", "D", "D", "F", "F"), urbanisation = c("B", 
"B", "A", "A", "B", "B", "A", "A", "C", "C", "C", "C", "A", "A", 
"B", "B", "C", "C", "A", "A", "C", "C", "B", "B", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "C", "C", "B", "B", "B", "B", 
"B", "B", "C", "C", "A", "A", "B", "B", "B", "B", "A", "A", "B", 
"B", "A", "A", "A", "A", "B", "B", "C", "C", "A", "A", "C", "C", 
"A", "A", "B", "B", "A", "A", "B", "B", "B", "B", "B", "B", "C", 
"C", "A", "A", "A", "A", "A", "A", "A", "A", "C", "C", "A", "A", 
"B", "B", "A", "A", "B", "B", "B", "B"), size = c(1, 1, 5, 5, 
5, 5, 1, 1, 1, 1, 5, 5, 5, 5, 2, 2, 2, 2, 5, 5, 1, 1, 1, 1, 5, 
5, 5, 5, 4, 4, 5, 5, 5, 5, 4, 4, 2, 2, 5, 5, 1, 1, 1, 1, 2, 2, 
1, 1, 2, 2, 5, 5, 1, 1, 3, 3, 2, 2, 2, 2, 5, 5, 4, 4, 1, 1, 5, 
5, 2, 2, 5, 5, 2, 2, 2, 2, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 
5, 5, 3, 3, 2, 2, 3, 3, 1, 1, 5, 5), base_rate = c(14L, 14L, 
14L, 14L, 19L, 19L, 30L, 30L, 20L, 20L, 29L, 29L, 20L, 20L, 20L, 
20L, 24L, 24L, 19L, 19L, 29L, 29L, 14L, 14L, 24L, 24L, 24L, 24L, 
24L, 24L, 14L, 14L, 17L, 17L, 19L, 19L, 33L, 33L, 19L, 19L, 24L, 
24L, 14L, 14L, 20L, 20L, 23L, 23L, 33L, 33L, 29L, 29L, 19L, 19L, 
33L, 33L, 24L, 24L, 24L, 24L, 19L, 19L, 33L, 33L, 29L, 29L, 33L, 
33L, 29L, 29L, 20L, 20L, 24L, 24L, 30L, 30L, 29L, 29L, 17L, 17L, 
30L, 30L, 23L, 23L, 24L, 24L, 24L, 24L, 20L, 20L, 29L, 29L, 24L, 
24L, 19L, 19L, 20L, 20L, 20L, 20L), taxrate = c(12L, 14L, 14L, 
12L, 21L, 18L, 30L, 30L, 20L, 20L, 29L, 30L, 20L, 20L, 20L, 20L, 
24L, 24L, 21L, 18L, 30L, 29L, 14L, 12L, 24L, 24L, 24L, 24L, 24L, 
24L, 14L, 12L, 18L, 19L, 18L, 21L, 33L, 32L, 21L, 18L, 24L, 24L, 
12L, 14L, 20L, 20L, 22L, 25L, 32L, 33L, 30L, 29L, 18L, 21L, 32L, 
33L, 24L, 24L, 24L, 24L, 18L, 21L, 32L, 33L, 30L, 29L, 32L, 33L, 
29L, 30L, 20L, 20L, 24L, 24L, 30L, 30L, 29L, 30L, 18L, 19L, 30L, 
30L, 22L, 25L, 24L, 24L, 24L, 24L, 20L, 20L, 30L, 29L, 24L, 24L, 
21L, 18L, 20L, 20L, 20L, 20L), vote = c(0, 0, 0, 0, 1, 1, 1, 
0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 
1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 
1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 
1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 
1, 0, 1, 1, 1, 1, 0, 1, 1), votewon = c(0, 0, 0, 0, 1, 0, 1, 
0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 
1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 
0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 
1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 
0, 0, 1, 1, 0, 1, 0, 1, 1)), class = "data.frame", row.names = c(NA, 
-100L))

## convert variables to factors beforehand
DF[c(1, 2, 4, 5, 6, 9, 10)] <- lapply(DF[c(1, 2, 4, 5, 6, 9, 10)], factor)

s1.tobit <- AER::tobit(taxrate ~ votewon + industry + size + urbanisation + vote,
                  left=12, right=33, data=DF)
yhat <- fitted(s1.tobit)
s2.tobit <- lm(sales ~ yhat + industry + size + urbanisation + vote, data=DF)

lmtest::coeftest(s2.tobit, vcov.=vcov2sls(s1.tobit, s2.tobit, DF))

tester tester · Accepted Answer · 2021-01-10T20:53:57

EDIT: This way you can run a lm in the first stage, adjust the model's SEs and use it to overwrite the SEs produced by summary(lm). Then, you can estimate the second stage and use your custom function with coeftest(). Not sure about the clustering but this should roughly work, no?

library(lmtest)
library(sandwich)
## run with lm ##
s1.lm <- lm(taxrate ~ votewon + industry + size + urbanisation + vote,
          data=DF)
# cluster and adjust ses
s1.robust <-  vcovCL(s1.lm, cluster = ~ industry)
s1.robust.se <- sqrt(diag(s1.robust))

s1.summary <- summary(s1.lm)
s1.summary$coefficients[, 2] <- s1.robust.se

yhat <- fitted(s1.lm)

s2.lm <- lm(sales ~ yhat + industry + size + urbanisation + vote, data=DF)

lmtest::coeftest(s2.lm, vcov.=vcov2sls(s1.summary, s2.lm, DF))

Take a look at the estimatr package, especially lm_robust. I'm not 100% sure what you intend to do, but you can get robust standard errors by running this:

library(estimatr)
lm1 <-
  lm_robust(
    mpg ~ hp,
    data = mtcars,
    clusters = cyl,
    weights = wt,
    se_type = "stata" # alternatives: CR0, CR1
  )
summary(lm1)

Using your example from above, is this roughly what you are looking for? Please note, that lm_robust already adjusts the SEs.

s1.lm <- lm_robust(taxrate ~ votewon + industry + size + urbanisation + vote,
                   data = DF)
yhat <- fitted(s1.lm)
s2.lm <- lm(sales ~ yhat + industry + size + urbanisation + vote, data = DF)

> lmtest::coeftest(s2.lm, vcov. = vcov2sls(s1.lm, s2.lm, DF))

t test of coefficients:

               Estimate Std. Error t value Pr(>|t|)
(Intercept)   -18.45116   62.14257 -0.2969   0.7672
yhat            1.57784    2.72176  0.5797   0.5636
industryE       0.98174    5.10677  0.1922   0.8480
industryF       2.09036    7.25181  0.2883   0.7738
size2          -8.85327   12.43454 -0.7120   0.4783
size3          -5.74011    7.14973 -0.8028   0.4242
size4         -10.79326   13.14534 -0.8211   0.4138
size5          -3.38280    5.45691 -0.6199   0.5369
urbanisationB  -1.74588    6.34107 -0.2753   0.7837
urbanisationC  -2.00370    6.48533 -0.3090   0.7581
vote1          -1.01661    6.49424 -0.1565   0.8760

> lmtest::coeftest(s2.lm)

t test of coefficients:

               Estimate Std. Error t value Pr(>|t|)
(Intercept)   -18.45116   46.39576 -0.3977   0.6918
yhat            1.57784    2.03207  0.7765   0.4395
industryE       0.98174    3.81273  0.2575   0.7974
industryF       2.09036    5.41421  0.3861   0.7004
size2          -8.85327    9.28365 -0.9536   0.3428
size3          -5.74011    5.33801 -1.0753   0.2851
size4         -10.79326    9.81434 -1.0997   0.2744
size5          -3.38280    4.07414 -0.8303   0.4086
urbanisationB  -1.74588    4.73425 -0.3688   0.7132
urbanisationC  -2.00370    4.84196 -0.4138   0.6800
vote1          -1.01661    4.84861 -0.2097   0.8344

I'll update my answer depending on your comments.

Correcting for robust/clustered standard errors within the lm function or replacing the results

EDIT I - Trying testers solution

EDIT II - Testing the p-values between the lm_robust and manual

Example Data

1 Answers