bootstrapped confidence intervals with dplyr

4

votes

Looking at the answer here How can I estimate bootstrapped intervals? This question was asked on the ggplot2 list as well.

library(dplyr)
mtcars %>%
group_by(vs) %>%
summarise(mean.mpg = mean(mpg, na.rm = TRUE),
    sd.mpg = sd(mpg, na.rm = TRUE),
    n.mpg = n()) %>%
mutate(se.mpg = sd.mpg / sqrt(n.mpg),
 lower.ci.mpg = mean.mpg - qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg,
 upper.ci.mpg = mean.mpg + qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg)

r dplyrconfidence-interval

7

votes

The Hmisc package has a function smean.cl.boot to compute simple bootstrap confidence intervals easily. The hardest part (IMO) is incorporating the multiple outputs of this result (the function returns a 3-element numeric vector) into a dplyr workflow (see dplyr::mutate to add multiple values)

library(Hmisc)  ## optional if using Hmisc:: below
library(dplyr)
mtcars %>%
  group_by(vs) %>%
  do(data.frame(rbind(Hmisc::smean.cl.boot(.$mpg))))

The new columns are labeled just Mean, Lower, Upper, but an additional setNames call would fix that ...

If doing a lot of this,

bootf <- function(x,var="mpg") {
    newstuff <- rbind(Hmisc::smean.cl.boot(x[[var]])) %>%
         data.frame %>%
         setNames(paste(var,c("mean","lwr","upr"),sep="_"))
    return(newstuff)
}
mtcars %>% group_by(vs) %>% do(bootf(.))
mtcars %>% group_by(cyl) %>% do(bootf(.))

4

votes

ORIGINAL ANSWER: Bootstrapping a single column

The code below includes a simple bootstrapping function plus some additional code to return an informative data frame:

my_boot = function(x, times=1000) {

   # Get column name from input object
   var = deparse(substitute(x))
   var = gsub("^\\.\\$","", var)

  # Bootstrap 95% CI
  cis = quantile(replicate(times, mean(sample(x, replace=TRUE))), probs=c(0.025,0.975))

  # Return data frame of results
  data.frame(var, n=length(x), mean=mean(x), lower.ci=cis[1], upper.ci=cis[2])
}

mtcars %>%
  group_by(vs) %>%
  do(my_boot(.$mpg))

     vs    var     n     mean lower.ci upper.ci
  <dbl> <fctr> <int>    <dbl>    <dbl>    <dbl>
1     0    mpg    18 16.61667 15.14972 18.06139
2     1    mpg    14 24.55714 22.36357 26.80750

UPDATE: Bootstrapping any selection of columns

Based on your comments, here is an updated method to get bootsrapped confidence intervals for any selection of columns:

library(reshape2)
library(tidyr)

my_boot = function(x, times=1000) {

  # Bootstrap 95% CI
  cis = quantile(replicate(times, mean(sample(x, replace=TRUE))), probs=c(0.025,0.975))

  # Return results as a data frame
  data.frame(mean=mean(x), lower.ci=cis[1], upper.ci=cis[2])
}

mtcars %>%
  group_by(vs) %>%
  do(as.data.frame(apply(., 2, my_boot))) %>% 
  melt(id.var="vs") %>%
  separate(variable, sep="\\.", extra="merge", into=c("col","stat")) %>%
  dcast(vs + col ~ stat, value.var="value")

   vs  col    lower.ci        mean    upper.ci
1   0   am   0.1111111   0.3333333   0.5555556
2   0 carb   3.0000000   3.6111111   4.2777778
3   0  cyl   6.8888889   7.4444444   7.8888889
4   0 disp 262.3205556 307.1500000 352.4481944
5   0 drat   3.1877639   3.3922222   3.6011528
6   0 gear   3.2222222   3.5555556   3.9444444
7   0   hp 164.0500000 189.7222222 218.5625000
8   0  mpg  14.9552778  16.6166667  18.3225000
9   0 qsec  16.1888750  16.6938889  17.1744583
10  0   vs   0.0000000   0.0000000   0.0000000
11  0   wt   3.2929569   3.6885556   4.0880069
12  1   am   0.2142857   0.5000000   0.7857143
13  1 carb   1.2857143   1.7857143   2.3571429
14  1  cyl   4.1428571   4.5714286   5.0000000
15  1 disp 105.5703571 132.4571429 161.4657143
16  1 drat   3.5992143   3.8592857   4.1100000
17  1 gear   3.5714286   3.8571429   4.1428571
18  1   hp  79.7125000  91.3571429 103.2142857
19  1  mpg  21.8498214  24.5571429  27.3289286
20  1 qsec  18.7263036  19.3335714  20.0665893
21  1   vs   1.0000000   1.0000000   1.0000000
22  1   wt   2.2367000   2.6112857   2.9745571

Other updates to answer questions in the comments

UPDATE: To answer your comment to me in @BenBolker's answer: If you want the results returned by sample, you can do this:

boot.dat = replicate(1000, sample(mtcars$mpg[mtcars$vs==1], replace=TRUE))

This will return a matrix with 1000 columns, each of which will be a separate bootstrap sample of mtcars$mpg for vs==1. You could also do:

boot.by.vs = sapply(split(mtcars, mtcars$vs), function(df) {
   replicate(1000, sample(df$mpg, replace=TRUE))
}, simplify=FALSE)

This will return a list where the first list element is the matrix of bootstrap samples for vs==0 and the second is for vs==1.

UPDATE 2: To answer your second comment, here's how to bootstrap the whole data frame (and assuming you want to save all the copies, rather than summarise them. The code below returns a list of 1000 bootstrapped versions of mtcars1. This list will be huge if you have a lot of data, so you'll probably just want to keep summary results, like column means, for each bootstrap sample.

boot.df = lapply(1:1000, function(i) mtcars[sample(1:nrow(mtcars), replace=TRUE), ])

2

votes

Using your code from above,

data.frame(boot=1:1000) %>%
  group_by(boot) %>% 
  do(sample_n(mtcars, nrow(mtcars), replace=TRUE)) %>%
  group_by(boot, vs) %>%
dplyr::summarise(mean.mpg = mean(mpg, na.rm = TRUE),
                 sd.mpg = sd(mpg, na.rm = TRUE),
                 n.mpg = n()) %>%
  mutate(se.mpg = sd.mpg / sqrt(n.mpg),
         lower.ci.mpg = mean.mpg - qt(1 - (0.1 / 2), n.mpg - 1) * se.mpg,
         upper.ci.mpg = mean.mpg + qt(1 - (0.1 / 2), n.mpg - 1) * se.mpg) %>% 
    group_by(vs) %>% summarise_each(funs(mean), vars = -boot)

The answer is

# A tibble: 2 x 7
     vs mean.mpg   sd.mpg n.mpg   se.mpg lower.ci.mpg upper.ci.mpg
  <dbl>    <dbl>    <dbl> <dbl>    <dbl>        <dbl>        <dbl>
1     0 16.62142 3.679562 17.97 0.876537     15.09220     18.15063
2     1 24.53193 5.125643 14.03 1.388702     22.05722     27.00663

bootstrapped confidence intervals with dplyr

3 Answers

ORIGINAL ANSWER: Bootstrapping a single column

UPDATE: Bootstrapping any selection of columns

Other updates to answer questions in the comments