When simulating multivariate data for regression, how can I set the R-squared (example code included)?

Question

I am trying to simulate a three-variable dataset so that I can run linear regression models on it. 'X1' and 'X2' would be continuous independent variables (mean=0, sd=1), and 'Y' would be the continuous dependent variable.

The variables will be regression model will produce coefficients like this: Y = 5 + 3(X1) - 2(X2)

I would like to simulate this dataset such that the resulting regression model has an R-squared value of 0.2. How can I determine the value of 'sd.value' so that the regression model has this R-squared?

n <- 200 
set.seed(101) 
sd.value <- 1

X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)

simdata <- data.frame(X1, X2, Y)

summary(lm(Y ~ X1 + X2, data=simdata))

Adrian Adrian · Accepted Answer · 2013-09-30T15:01:03

Take a look at this code, it should be close enough to what you want:

simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
    stopifnot(length(beta) == 3)
    df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs))  # x1 and x2 are independent
    var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
    stopifnot(var.epsilon > 0)
    df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
    df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
    return(df)
}
get.R.sq <- function(desired) {
    model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
    return(summary(model)$r.squared)
}
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
plot(df)
abline(a=0, b=1, col="red", lty=2)

Basically your question comes down to figuring out the expression for var.epsilon. Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. You can then solve for var[eps] as a function of R-squared.

When simulating multivariate data for regression, how can I set the R-squared (example code included)?

4 Answers