I would like to simulate data for a logistic regression where I can specify its explained variance beforehand. Have a look at the code below. I simulate four independent variables and specify that each logit coefficient should be of size log(2)=0.69. This works nicely, the explained variance (I report Cox & Snell's r2) is 0.34.
However, I need to specify the regression coefficients in such a way that a pre-specified r2 will result from the regression. So if I would like to produce an r2 of let's say exactly 0.1. How do the coefficients need to be specified? I am kind of struggling with this..
# Create independent variables
sigma.1 <- matrix(c(1,0.25,0.25,0.25,
0.25,1,0.25,0.25,
0.25,0.25,1,0.25,
0.25,0.25,0.25,1),nrow=4,ncol=4)
mu.1 <- rep(0,4)
n.obs <- 500000
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))
# Create latent continuous response variable
sample1$ystar <- 0 + log(2)*sample1$V1 + log(2)*sample1$V2 + log(2)*sample1$V3 + log(2)*sample1$V4
# Construct binary response variable
sample1$prob <- exp(sample1$ystar) / (1 + exp(sample1$ystar))
sample1$y <- rbinom(n.obs,size=1,prob=sample1$prob)
# Logistic regression
logreg <- glm(y ~ V1 + V2 + V3 + V4, data=sample1, family=binomial)
summary(logreg)
The output is:
Call:
glm(formula = y ~ V1 + V2 + V3 + V4, family = binomial, data = sample1)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7536 -0.7795 -0.0755 0.7813 3.3382
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.002098 0.003544 -0.592 0.554
V1 0.691034 0.004089 169.014 <2e-16 ***
V2 0.694052 0.004088 169.776 <2e-16 ***
V3 0.693222 0.004079 169.940 <2e-16 ***
V4 0.699091 0.004081 171.310 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 693146 on 499999 degrees of freedom
Residual deviance: 482506 on 499995 degrees of freedom
AIC: 482516
Number of Fisher Scoring iterations: 5
And Cox and Snell's r2 gives:
library(pscl)
pR2(logreg)["r2ML"]
> pR2(logreg)["r2ML"]
r2ML
0.3436523
independent
variables? – RolandASc