1
votes

I'm new to R and am trying to calculate the 95% confidence intervals for the R-squared values and residual standard error for linear models have formed by using the bootstrap method to resample the response variable, and then create 999 linear models by regressing these 999 bootstrapped response variables on the original explanatory variable.

First of all, I am not sure if I should be calculating the 95% CI for R-squared and residual standard error for the ORIGINAL linear model (without the bootstrap data), because that doesn't make sense - the R-squared value is 100% exact for that linear model, and it doesn't make sense to calculate a CI for it.

Is that correct?

Importantly I'm not sure how to calculate the CI for the R-squared values and residual standard error values for the 999 linear models I've created from bootstrapping.

1
I have never heard of resampling the response variables and using the original explanatory variable. Are you sure that is what you want to do?Seth
are you supposed to resample BOTH the response and explanatory variables?user2303557
In the normal bootstrap procedure you create a new dataframe by resampling the entire rows from the original dataframe.Seth
I think that is what I am doing. What I have is a table with two columns. One is tip percentage, one is total bill. I'm trying to study if we can use the total bill (explanatory variable) to predict the tip percentage (response variable). Right now, I am bootstrapping by sampling the tip percentage from itself to generate 999 new samples of the tip percentage.user2303557

1 Answers

3
votes

You can definitely use the boot package to do this. But because I may be confused about what you want Ill go step by step.

I make up some fake data

n=10
x=rnorm(n)
realerror=rnorm(n,0,.9)
beta=3
y=beta*x+realerror

make an empty place to catch the statistics I am interested in.

rsquared=NA
sse=NA

Then make a for loop that will resample the data, run a regression and collect two statistics for each iteration.

for(i in 1:999)
{
   #create a vector of the index to resample data row-wise with replacement.
  use=sample(1:n,replace=T)

  lm1=summary(lm(y[use]~x[use]))  

  rsquared[i]=lm1$r.squared

  sse[i]=sum(lm1$residuals^2)
}

Now I want to figure out the confidence intervals so I order each of them and report out the (n*.025)th and the (n*.975)th first order the statistics

 sse=sse[order(sse)]
 rsquared=rsquared[order(rsquared)]

Then the 25th is the lower confidence limit and the 975th is the upper confidence limit

> sse[c(25,975)]
[1]  2.758037 18.027106
> rsquared[c(25,975)]
[1] 0.5613399 0.9795167