Simulate data for regression with standardized Y

Question

Based on this topic, I have created a function that returns a dataset with variables related to the outcome (y) by specific linear coefs.

simulate_data_regression <- function(sample=10, coefs=0, error=0){

  n_var <- length(coefs)
  X <- matrix(0, ncol=n_var, nrow=sample)

  beta <- as.matrix(coefs)

  for (i in 1:n_var){
    X[,i] <- scale(rnorm(sample, 0, 1))
  }

  y <- X %*% beta

  if(error != 0){
    y <- y + rnorm(sample, 0, error)
  }


  data = data.frame(X=X)
  names(data) <- paste0("V", 1:n_var)
  data$y <- as.vector(y)

  return(data)
}

data <- simulate_data_regression(sample=50, coefs=c(0.1, 0.8), error=0)
summary(data)
sd(data$V1)
sd(data$y)

It works great. However, I would need to have a standardized y (mean 0 and SD 1). But when I try to scale it, the coefficients change:

data <- simulate_data_regression(sample=50, coefs=c(0.1, 0.8), error=0)
data$y <- as.vector(scale(data$y))
coef(lm(y ~ ., data=data))

It is possible to do such thing? Thank you very much!

Edit

In other words, I would like the coefs that are specified to be standardized coefs (expressed in outcome's SD).

Scaling y a posteriori changes the coefs by 1/sd(y). However, I can't think of any way to change the betas before generating y, so that the betas return to their specified value after the scaling of y.

Edit 2: Failed attempt

I've tried running the function twice, first extracting sd(y) and scaling the coefficients with it, in the hope that those scaled coefficients will change to the specified ones once I'll scale y. But it doens't work, which is expected, as sd(y) changes when I change the coefs :'(

Here's the failed attempt:

simulate_data_regression <- function(sample=10, coefs=0, error=0, standardized=TRUE){

  stuff <- .simulate_data_regression(sample=sample, coefs=coefs, error=error)
  if(standardized == TRUE){
    y_sd <- sd(data$y)
    data <- .simulate_data_regression(sample=sample, coefs=y_sd*coefs, error=error, X=stuff$X)$data
    data$y <- as.vector(scale(data$y))
  } else{
    data <- stuff$data
  }
  return(data)
}


.simulate_data_regression <- function(sample=10, coefs=0, error=0, X=NULL, y=NULL){

  n_var <- length(coefs)

  if(is.null(X)){
    X <- matrix(0, ncol=n_var, nrow=sample)
    for (i in 1:n_var){
      X[,i] <- scale(rnorm(sample, 0, 1))
    }
  }

  beta <- as.matrix(coefs)
  y <- X %*% beta

  if(error != 0){
    y <- y + rnorm(sample, 0, error)
  }


  data = data.frame(X=X)
  names(data) <- paste0("V", 1:n_var)
  data$y <- as.vector(y)

  return(list(X=X, y=y, data=data))
}

If you want to get the same coefficients you need to call scale(y, scale = FALSE). The betas are invariant to location so you can center at will. But they are not invariant to scaling, if you scale x1 by a factor then beta1 will be multiplied by 1/factor1. — Rui Barradas
It's something, but the scaling of y is actually what I would need the most (so that the specified coefs are in fact standardized coefs)... Is there any way of generating this kind of data? — Dominique Makowski

Rui Barradas Rui Barradas · Accepted Answer · 2018-08-23T12:37:55

If you scale y the inference will be the same, only the p-values of the intercepts change, not the p-values of the coefficients.
In this example I have set error = 1.

set.seed(1234)    # Make the results reproducible
data <- simulate_data_regression(sample = 50, coefs = c(0.1, 0.8), error = 1)
data2 <- data
data2$y <- scale(data2$y)

fit <- lm(y ~ ., data)
fit2 <- lm(y ~ ., data2)

summary(fit)
summary(fit2)

As you can see the p-values of the coefficients are exactly the same though the coefficients themselves are different. You would expect that since you are scaling by the standard errors of the regressors and therefore the coefficients will be scaled by the inverses of those standard errors.

The version of your function below has an argument, which, that allows to specify which regressors to scale. Its default is all of them.

simulate_data_regression2 <- function(sample = 10, coefs = 0, error = 0, which = seq_along(coefs)){
  n_var <- length(coefs)
  X <- matrix(0, ncol=n_var, nrow=sample)
  beta <- as.matrix(coefs)
  for (i in 1:n_var){
    X[,i] <- rnorm(sample, 0, 1)
    if(i %in% which) X[, i] <- scale(X[, i])
  }
  y <- X %*% beta
  if(error != 0){
    y <- y + rnorm(sample, 0, error)
  }
  data = data.frame(X=X)
  names(data) <- paste0("V", 1:n_var)
  data$y <- as.vector(y)
  data
}

Now test the function.

set.seed(1234)    # Make the results reproducible
data <- simulate_data_regression2(sample=50, coefs=c(0.1, 0.8), error=1)

set.seed(1234)    # Reproduce the data generation process
data2 <- simulate_data_regression2(sample=50, coefs=c(0.1, 0.8), error=1, which = 2)

fit <- lm(y ~ ., data)
fit2 <- lm(y ~ ., data2)

As you can see the coefficients of V2 are equal.

coef(fit)
#(Intercept)          V1          V2 
# 0.01997809  0.19851020  0.96310013

coef(fit2)
#(Intercept)          V1          V2 
# 0.07040538  0.21130549  0.96310013

The p-values of the estimates of the coefficients V2 are also equal

summary(fit)
summary(fit2)

Simulate data for regression with standardized Y

Edit

Edit 2: Failed attempt

1 Answers