Based on this topic, I have created a function that returns a dataset with variables related to the outcome (y) by specific linear coefs.
simulate_data_regression <- function(sample=10, coefs=0, error=0){
n_var <- length(coefs)
X <- matrix(0, ncol=n_var, nrow=sample)
beta <- as.matrix(coefs)
for (i in 1:n_var){
X[,i] <- scale(rnorm(sample, 0, 1))
}
y <- X %*% beta
if(error != 0){
y <- y + rnorm(sample, 0, error)
}
data = data.frame(X=X)
names(data) <- paste0("V", 1:n_var)
data$y <- as.vector(y)
return(data)
}
data <- simulate_data_regression(sample=50, coefs=c(0.1, 0.8), error=0)
summary(data)
sd(data$V1)
sd(data$y)
It works great. However, I would need to have a standardized y (mean 0 and SD 1). But when I try to scale it, the coefficients change:
data <- simulate_data_regression(sample=50, coefs=c(0.1, 0.8), error=0)
data$y <- as.vector(scale(data$y))
coef(lm(y ~ ., data=data))
It is possible to do such thing? Thank you very much!
Edit
In other words, I would like the coefs that are specified to be standardized coefs (expressed in outcome's SD).
Scaling y a posteriori changes the coefs by 1/sd(y). However, I can't think of any way to change the betas before generating y, so that the betas return to their specified value after the scaling of y.
Edit 2: Failed attempt
I've tried running the function twice, first extracting sd(y) and scaling the coefficients with it, in the hope that those scaled coefficients will change to the specified ones once I'll scale y. But it doens't work, which is expected, as sd(y) changes when I change the coefs :'(
Here's the failed attempt:
simulate_data_regression <- function(sample=10, coefs=0, error=0, standardized=TRUE){
stuff <- .simulate_data_regression(sample=sample, coefs=coefs, error=error)
if(standardized == TRUE){
y_sd <- sd(data$y)
data <- .simulate_data_regression(sample=sample, coefs=y_sd*coefs, error=error, X=stuff$X)$data
data$y <- as.vector(scale(data$y))
} else{
data <- stuff$data
}
return(data)
}
.simulate_data_regression <- function(sample=10, coefs=0, error=0, X=NULL, y=NULL){
n_var <- length(coefs)
if(is.null(X)){
X <- matrix(0, ncol=n_var, nrow=sample)
for (i in 1:n_var){
X[,i] <- scale(rnorm(sample, 0, 1))
}
}
beta <- as.matrix(coefs)
y <- X %*% beta
if(error != 0){
y <- y + rnorm(sample, 0, error)
}
data = data.frame(X=X)
names(data) <- paste0("V", 1:n_var)
data$y <- as.vector(y)
return(list(X=X, y=y, data=data))
}
scale(y, scale = FALSE). The betas are invariant to location so you can center at will. But they are not invariant to scaling, if you scalex1by a factor thenbeta1will be multiplied by1/factor1. - Rui Barradas