0
votes

I am currently trying to run a bootstrap analysis on some data where the end result is to get bootstrap confidence intervals around proportions of count data.

For example, my current data that I am trying to bootstrap will take this form (character):


> foo
   notes
1      a
2      b
3      c
4      c
5      b
6      c
7      b
8      c
9      a
10     a
11     c
12     b
13     d
14     e
15     f
16     f
17     g
18     a
19     b
20     c
21     c

Which you can get here with dput()

structure(list(notes = c("a", "b", "c", "c", "b", "c", "b", "c", 
"a", "a", "c", "b", "d", "e", "f", "f", "g", "a", "b", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))

In trying to set up a function that will output a named vector similar to what is needed for the boot package to run properly ( see example here), I have composed the following function that uses dplyr code:

library(dplyr)

notes_bootstrap <- function(d, i){
  # get global set
  global_set <- d %>% distinct()

  # take random rows 
  sampler <- d#[i,]
  
  proportion_table <- sampler %>%
    count(.data$notes) %>%
    mutate(proportion = n/sum(n)) %>%
    ungroup()
  
  # combine with full set to turn NAs to 0s
  combined_table <- proportion_table %>% full_join(global_set)
  final_table <- combined_table %>% 
    select(-n) %>%
    mutate(proportion = if_else(is.na(proportion),0,proportion))
  
  output <- setNames(final_table$proportion, final_table$notes)
  
  return(output)
  
}

And when this version of the function is run with boot(), it runs just fine with the critical problem of it just sampling the entire dataset (not doing a bootstrap because of the commented out portion of the code). If you run this, you'll see every estimate is the same.

bootstrap_analysis <- boot(foo, notes_bootstrap, R = 100)

bootstrap_analysis$t

If I do run the function with the portion that randomly subsets the variables for the bootstrap analysis, as in the code below (same as above but comment removed):


notes_bootstrap <- function(d, i){
  # get global set
  global_set <- d %>% distinct()

  # take random rows 
  sampler <- d[i,]
  
  proportion_table <- sampler %>%
    count(.data$notes) %>%
    mutate(proportion = n/sum(n)) %>%
    ungroup()
  
  # combine with full set to turn NAs to 0s
  combined_table <- proportion_table %>% full_join(global_set)
  final_table <- combined_table %>% 
    select(-n) %>%
    mutate(proportion = if_else(is.na(proportion),0,proportion))
  
  output <- setNames(final_table$proportion, final_table$notes)
  
  return(output)
  
}

Then I get the following error:

> bootstrap_analysis <- boot(foo, notes_bootstrap, R = 100)
 Error in UseMethod("group_by_") : 
  no applicable method for 'group_by_' applied to an object of class "character" 

A solution to the problem would be for this code to run so the bootstrap analysis works as written (possibly a tidy evaluation problem?) or for someone to suggest a more efficient way doing this bootstrap analysis in general.

1
sampler <- d[i,, drop = FALSE]. Extraction defaults to simplifying to the least possible dimensions and since d is just one column, the result of d[i,] is a character vector, not a df. Also, when bootstrapping, set the RNG seed in order to make the results reproducible, set.seed(<integer>).Rui Barradas

1 Answers

1
votes

Interesting issue! I have tried to solve this problem without using the boot-package but rather by using the base-functions (mainly because of transparency purposes).

I could figure out this:

#Assigning the provided structure to an object called "df"

df <- structure(list(notes = c("a", "b", "c", "c", "b", "c", "b", "c", 
                               "a", "a", "c", "b", "d", "e", "f", "f", "g", "a", "b", 
                               "c", "c")),
               class = "data.frame", row.names = c(NA, -21L))

#Specifying the bootstrap replications (as far as I know, it's, however, 
#rather recommended to use 10K replications and more)

B <- 100

#Number of observations (i.e., 21 in this case)

N <- nrow(df)

#setting the seed to ensure pseudo-randomness for the samples
#we just want to generate and, of course, to ensure general reproducibility 

set.seed(42, sample.kind = "Rounding")

#bootstrapping the proportion (aka mean) of the note "a" 

boot_note_a <- replicate(B, {

#taking random samples of the same sample size of those notes 
#and putting back the taken sample in the urn—for each iteration

notes_star <- sample(df$notes, N, replace = T)

#getting the proportion of the note "a" within each bootstrapped sample.
#Hence, we'll get B (100 in this case) times a proportion of the note "a"
#based on the respective bootstrapped sample.  

mean(notes_star == "a")

}) 

#getting the confidence interval (at a confidence level of 95%) of the 
#bootstrapped proportion of the note "a" in the bootstrapped sample

quantile(boot_note_a, prob = c(0.025, 0.975)) 

Finally, we can quickly plot (pun intended) this result like this:

#calculating the binwidth according to Freedman & Diaconis (1981); see also 
#Hyndman (1995)

binw <- 2 * IQR(boot_note_a) / length(boot_note_a)^(1/3) 

#plotting 
p1 <- qplot(boot_note_a, binwidth = binw, color = I("red") )
p2 <- qplot(sample = scale(boot_note_a), xlab = "theoretical", ylab = "sample")+ 
            geom_abline()
gridExtra::grid.arrange(p1, p2, ncol = 2)

At the bottom line, I think you'll get the desired result—at least for the note "a" (admittedly, we would have to repeat this strategy for the remaining six notes). Thus, this solution might not be the most efficient way but, hopefully, transparent after all. If this solving-strategy is working well enough, we can adapt this and make it more efficient by using the apply-family or so.

Cheers, Kew!