I'd like to simulate data for a multiple linear regression with four predictors where I am free to specify
- the overall explained variance of the model
- the magnitude of all standardized regression coefficients
- the degree to which the predictor variables are correlated with each other
I arrived at a solution that fulfills the first two points but is based on the assumption that all independent variables are not related to each other (see code below). In order to get standardized regression coefficients, I sample from population variables with mean=0 and variance=1.
# Specify population variance/covariance of four predictor variables that is sampled from
sigma.1 <- matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1),nrow=4,ncol=4)
# Specify population means of four predictor varialbes that is sampled from
mu.1 <- rep(0,4)
# Specify sample size, true regression coefficients, and explained variance
n.obs <- 50000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.30
# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))
# Add error variable based on desired r2
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 +
beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon
# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))
Call:
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)
Residuals:
Min 1Q Median 3Q Max
-4.0564 -0.6310 -0.0048 0.6339 3.7119
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.496063 0.004175 118.82 <2e-16 ***
V1 0.402588 0.004189 96.11 <2e-16 ***
V2 0.291636 0.004178 69.81 <2e-16 ***
V3 0.247347 0.004171 59.30 <2e-16 ***
V4 0.253810 0.004175 60.79 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9335 on 49995 degrees of freedom
Multiple R-squared: 0.299, Adjusted R-squared: 0.299
F-statistic: 5332 on 4 and 49995 DF, p-value: < 2.2e-16
Problem: If my predictor variables are correlated, so if their variance/covariance matrix is specified without the off-diagonal elements being 0, the r2 and regression coefficients differ largely from how I want them to be, e.g. by using
sigma.1 <- matrix(c(1,0.25,0.25,0.25,
0.25,1,0.25,0.25,
0.25,0.25,1,0.25,
0.25,0.25,0.25,1),nrow=4,ncol=4)
Any suggestions? Thanks!