Calculating length of 95%-CI using dplyr

14

votes

Last time I asked how it was possible to calculate the average score per measurement occasion (week) for a variable (procras) that has been measured repeatedly for multiple respondents. So my (simplified) dataset in long format looks for example like the following (here two students, and 5 time points, no grouping variable):

studentID  week   procras
   1        0     1.4
   1        6     1.2
   1        16    1.6
   1        28    NA
   1        40    3.8
   2        0     1.4
   2        6     1.8
   2        16    2.0
   2        28    2.5
   2        40    2.8

Using dplyr I would get the average score per measurement occasion

mean_data <- group_by(DataRlong, week)%>% summarise(procras = mean(procras, na.rm = TRUE))

Looking like this e.g.:

Source: local data frame [5 x 2]
        occ  procras
      (dbl)    (dbl)
    1     0 1.993141
    2     6 2.124020
    3    16 2.251548
    4    28 2.469658
    5    40 2.617903

With ggplot2 I could now plot the average change over time, and by easily adjusting the group_data() of dplyr I could also get means per sub groups (for instance, the average score per occasion for men and women). Now I would like to add a column to the mean_data table which includes the length for the 95%-CIs around the average score per occasion.

http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/ explains how to get and plot CIs, but this approach seems to become problematic as soon as I wanted to do this for any subgroup, right? So is there a way to let dplyr also include the CI (based on group size, ect.) automatically in the mean_data? After that it should be fairly easy to plot the new values as CIs into the graphs I hope. Thank you.

r ggplot2linechartconfidence-intervaltrend

25

votes

You could do it manually using mutate a few extra functions in summarise

library(dplyr)
mtcars %>%
  group_by(vs) %>%
  summarise(mean.mpg = mean(mpg, na.rm = TRUE),
            sd.mpg = sd(mpg, na.rm = TRUE),
            n.mpg = n()) %>%
  mutate(se.mpg = sd.mpg / sqrt(n.mpg),
         lower.ci.mpg = mean.mpg - qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg,
         upper.ci.mpg = mean.mpg + qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg)

#> Source: local data frame [2 x 7]
#> 
#>      vs mean.mpg   sd.mpg n.mpg    se.mpg lower.ci.mpg upper.ci.mpg
#>   (dbl)    (dbl)    (dbl) (int)     (dbl)        (dbl)        (dbl)
#> 1     0 16.61667 3.860699    18 0.9099756     14.69679     18.53655
#> 2     1 24.55714 5.378978    14 1.4375924     21.45141     27.66287

11

votes

I use the ci command from the gmodels package:

library(gmodels)
your_db %>% group_by(gouping_variable1, grouping_variable2, ...)
        %>% summarise(mean = ci(variable_of_interest)[1], 
                      lowCI = ci(variable_of_interest)[2],
                      hiCI = ci(variable_of_interest)[3], 
                      sd = ci (variable_of_interest)[4])

3

votes

If you want to use the versatility of the boot package, I found this blog post useful (the code below is inspired from there)

library(dplyr)
library(tidyr)
library(purrr)
library(boot)

set.seed(321)
mtcars %>%
  group_by(vs) %>%
  nest() %>% 
  mutate(boot_res = map(data,
                        ~ boot(data = .$mpg,
                               statistic = function(x, i) mean(x[i]),
                               R = 1000)),
         boot_res_ci = map(boot_res, boot.ci, type = "perc"),
         mean = map(boot_res_ci, ~ .$t0),
         lower_ci = map(boot_res_ci, ~ .$percent[[4]]),
         upper_ci = map(boot_res_ci, ~ .$percent[[5]]),
         n =  map(data, nrow)) %>% 
  select(-data, -boot_res, -boot_res_ci) %>% 
  unnest(cols = c(n, mean, lower_ci, upper_ci)) %>% 
  ungroup()
#> # A tibble: 2 x 5
#>      vs  mean lower_ci upper_ci     n
#>   <dbl> <dbl>    <dbl>    <dbl> <int>
#> 1     0  16.6     15.0     18.3    18
#> 2     1  24.6     22.1     27.3    14

^{Created on 2020-01-22 by the reprex package (v0.3.0)}

Some explanation of the code:

When nesting with nest(), a list column (called by default data) is created, which contains 2 data frames, being the 2 subsets of the entire mtcars grouped by vs (which contains 2 unique values, 0 and 1). Then, using mutate() and map(), we create the list column boot_res by applying the function boot() from the boot package to the list column data. Then boot_res_ci list column is created by applying the boot.ci() function to boot_res list column and so on. With select(), we drop the list columns that are not needed anymore, fallowed by unnesting and ungrouping the final results.

The code is unfortunately not easy to navigate but it serves the purpose of another example.

Using `broom::tidy()`

Just realized that the package broom has an implementation of a method to deal with boot() outputs as pointed out here. This makes the code a bit less verbose and the output even a bit more complete, including the bias and the standard error of the statistic (the mean here):

library(dplyr)
library(tidyr)
library(purrr)
library(broom)
library(boot)

set.seed(321)
mtcars %>%
  group_by(vs) %>%
  nest() %>% 
  mutate(boot_res = map(data,
                        ~ boot(data = .$mpg,
                               statistic = function(x, i) mean(x[i]),
                               R = 1000)),
         boot_tidy = map(boot_res, tidy, conf.int = TRUE, conf.method = "perc"),
         n = map(data, nrow)) %>% 
  select(-data, -boot_res) %>% 
  unnest(cols = -vs) %>% 
  ungroup()
#> # A tibble: 2 x 7
#>      vs statistic    bias std.error conf.low conf.high     n
#>   <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <int>
#> 1     0      16.6 -0.0115     0.843     15.0      18.3    18
#> 2     1      24.6 -0.0382     1.36      22.1      27.3    14

^{Created on 2020-01-22 by the reprex package (v0.3.0)}

`data.table` concise syntax

Note, however, that, I got to a more concise syntax by using the data.table package instead of dplyr:

library(data.table)
library(magrittr)
library(boot)
library(broom)

mtcars <- mtcars %>% copy %>% setDT

set.seed(321)
mtcars[, c(n = .N,
           boot(data = mpg,
                statistic = function(x, i) mean(x[i]),
                R = 1000) %>% 
             tidy(conf.int = TRUE, conf.method = "perc")),
       by = vs]
#>    vs  n statistic        bias std.error conf.low conf.high
#> 1:  0 18  16.61667 -0.01149444 0.8425817 15.03917  18.26653
#> 2:  1 14  24.55714 -0.03822857 1.3633112 22.06429  27.32839

^{Created on 2020-01-23 by the reprex package (v0.3.0)}

Multiple variables at once with data.table

library(data.table)
library(magrittr)
library(boot)
library(broom)

mtcars <- mtcars %>% copy %>% setDT

# Specify here the variables for which you want CIs
variables <- c("mpg", "disp") 

# Function to get the CI stats, will be applied to each column of a subset of
# data (.SD)
get_ci <- function(varb, ...){
  boot(data = varb,
       statistic = function(x, i) mean(x[i]),
       R = 1000) %>% 
    tidy(conf.int = TRUE, ...)
}

set.seed(321)
mtcars[, c(n = .N,
           lapply(.SD, get_ci) %>% 
             rbindlist(idcol = "varb")),
       by = vs, .SDcols = variables]
#>    vs  n varb statistic        bias  std.error  conf.low conf.high
#> 1:  0 18  mpg  16.61667 -0.01149444  0.8425817  15.03917  18.26653
#> 2:  0 18 disp 307.15000 -1.49692222 23.1501247 261.18766 353.04416
#> 3:  1 14  mpg  24.55714 -0.03215714  1.3800432  21.86628  27.50551
#> 4:  1 14 disp 132.45714  0.32994286 14.9070552 104.45798 163.57344

^{Created on 2020-01-23 by the reprex package (v0.3.0)}

1

votes

Update tidyr 1.0.0

All the solutions given by @Valentin are viable but I wanted to hint to a new alternative which is more readable for some of you. It replaces all the summarise solutions by a relatively new [tidyr 1.0.0][1] function called unnest_wider. With that, you can simplify the code to the following:

mtcars %>% 
  nest(data = -"vs") %>%
  mutate(ci = map(data, ~ MeanCI(.x$mpg, method = "boot", R = 1000))) %>% 
  unnest_wider(ci)

which gives:

# A tibble: 2 x 5
     vs data                mean lwr.ci upr.ci
  <dbl> <list>             <dbl>  <dbl>  <dbl>
1     0 <tibble [18 × 10]>  16.6   14.7   18.5
2     1 <tibble [14 × 10]>  24.6   22.0   27.1

Calculating confidence intervals without bootstrapping is even simpler with:

mtcars %>% 
  nest(data = -"vs") %>%
  mutate(ci = map(data, ~ MeanCI(.x$mpg))) %>% 
  unnest_wider(ci)

Calculating length of 95%-CI using dplyr

4 Answers

Using broom::tidy()

data.table concise syntax

Multiple variables at once with data.table

Update tidyr 1.0.0

Using `broom::tidy()`

`data.table` concise syntax