2
votes

I am trying to simulate the sensitivity of matching versus regression (OLS) but I am doing something wrong somewhere because I cannot retrieve the true model using matching.

I am generating 3 variables: x, a background characteristic, d which is the Treatment variable (binary) and y outcome. d is associated with x. The idea of matching is that once conditioned on x, the treatment assignment generation process is as good as random. In the regression world, x is simply a control variable. I want to test how regression performs when there is region of non-common support (no treated above or below certain values) in the data.

library(tidyverse)
library(Matching)
library(MatchIt)

N = 1000
# generate random variable normality dist #
x = rnorm(N, 0, 5)

This is how I generate an association between x and d (binary).

# generate Treatement associated with x, with different probailities after a certain threshold #
d = ifelse(x > 0.7, rbinom(0.7 * N, 1, 0.6) , rbinom( (1 - 0.7) * N, 1, 0.3) )
# beyond 0.7 the proba is 0.6 to receive treatment and below is 0.3 #

Seems correct to me, but if you have a better way to do it, let me know.

# adding a bit more randomness #
d[ sample(length(d), 100) ] <- rbinom(100, 1, 0.5)

# also adding a cut-off point for the treated #  
d[x < -10] <- 0
d[x > 10] <- 0

I am generating the effect of d using ifelse, seems correct to me, but I might be wrong.

# generate outcome y, w/ polyn relationship with x and a Treatment effect of 15 # sd == 10 #
y = x*1 + x^2 + rnorm(N, ifelse(d == 1, 15, 0), 10)

#
df = cbind(x,d,y) %>% as.data.frame()
# check out the "common support"
df %>% ggplot(aes(x, y, colour = factor(d) )) + geom_point()
#

The plot shows the way I want to model the 3 relationships. Note the cut-offs above and below 10 for the treated.

enter image description here

Now when I estimate the effect of d on y with OLS, the model with omitted variable and the misspecified model as expected give me an incorrect estimate of d.

# omitted x #
lm(y ~ d, df) %>% summary()
# misspecification #
lm(y ~ d + x, df) %>% summary()
# true model #

While the correct specification gets me 15 (the true effect of d).

lm(y ~ d + poly(x,2), df) %>% summary()
# we correctly retrieve 15 #

Now my issue is to understand why I can't get to 15 (true effect of d) with the matching packages.

Using MatchIt package.

I tried with mahalanobis and a propensity score like this:

m1 = matchit(d ~ x, df, distance = 'mahalanobis', method = 'genetic')
m2a = matchit(d ~ x, df, distance = 'logit', method = 'genetic')
m2b = matchit(d ~ x + I(x^2), df, distance = 'logit', method = 'genetic')

Matching the data

mat1 = match.data(m1)
mat2a = match.data(m2a)
mat2b = match.data(m2b)

# OLS #
lm(y ~ d, mat1) %>% summary()
lm(y ~ d, mat2a) %>% summary()
lm(y ~ d, mat2b) %>% summary()

So here I do not retrieve the 15. Why? Am I misinterpreting the results? I was under the impression that when doing matching, you did not have to model polynomial terms or/and interactions. Is that incorrect?

lm(y ~ d + poly(x,2), mat1) %>% summary()
lm(y ~ d + poly(x,2), mat2a) %>% summary()
lm(y ~ d + poly(x,2), mat2b) %>% summary()

Because I retrieve 15 if I include the poly(x,2) term here.

Using Matching package, I also get a completely different estimate

x1 = df$x
gl = glm(d ~ x + I(x^2), df, family = binomial)
x1 = gl$fitted.values

# I thought that it could be because OLS only gives ATE #
m0 = Match(Y = y, Tr = d, X = x1, estimand = 'ATE')
# but no 
m0$est

Any clue?

1

1 Answers

2
votes

An important output of the matching procedure are the weights on the control observations. The weights are computed such that the distribution of the propensity score is similar in the treated and the control group (once the weights are applied).

In your case, this means (starting from your dgp and with your notations):

lm(y ~ d, mat1, weights = weights) %>% summary()
lm(y ~ d, mat2a, weights = weights) %>% summary()
lm(y ~ d, mat2b, weights = weights) %>% summary()

And here we are: 15 is back (or actually 14.9)!