I am currently trying to run a bootstrap analysis on some data where the end result is to get bootstrap confidence intervals around proportions of count data.
For example, my current data that I am trying to bootstrap will take this form (character):
> foo
notes
1 a
2 b
3 c
4 c
5 b
6 c
7 b
8 c
9 a
10 a
11 c
12 b
13 d
14 e
15 f
16 f
17 g
18 a
19 b
20 c
21 c
Which you can get here with dput()
structure(list(notes = c("a", "b", "c", "c", "b", "c", "b", "c",
"a", "a", "c", "b", "d", "e", "f", "f", "g", "a", "b", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))
In trying to set up a function that will output a named vector similar to what is needed for the boot package to run properly ( see example here), I have composed the following function that uses dplyr
code:
library(dplyr)
notes_bootstrap <- function(d, i){
# get global set
global_set <- d %>% distinct()
# take random rows
sampler <- d#[i,]
proportion_table <- sampler %>%
count(.data$notes) %>%
mutate(proportion = n/sum(n)) %>%
ungroup()
# combine with full set to turn NAs to 0s
combined_table <- proportion_table %>% full_join(global_set)
final_table <- combined_table %>%
select(-n) %>%
mutate(proportion = if_else(is.na(proportion),0,proportion))
output <- setNames(final_table$proportion, final_table$notes)
return(output)
}
And when this version of the function is run with boot()
, it runs just fine with the critical problem of it just sampling the entire dataset (not doing a bootstrap because of the commented out portion of the code). If you run this, you'll see every estimate is the same.
bootstrap_analysis <- boot(foo, notes_bootstrap, R = 100)
bootstrap_analysis$t
If I do run the function with the portion that randomly subsets the variables for the bootstrap analysis, as in the code below (same as above but comment removed):
notes_bootstrap <- function(d, i){
# get global set
global_set <- d %>% distinct()
# take random rows
sampler <- d[i,]
proportion_table <- sampler %>%
count(.data$notes) %>%
mutate(proportion = n/sum(n)) %>%
ungroup()
# combine with full set to turn NAs to 0s
combined_table <- proportion_table %>% full_join(global_set)
final_table <- combined_table %>%
select(-n) %>%
mutate(proportion = if_else(is.na(proportion),0,proportion))
output <- setNames(final_table$proportion, final_table$notes)
return(output)
}
Then I get the following error:
> bootstrap_analysis <- boot(foo, notes_bootstrap, R = 100)
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "character"
A solution to the problem would be for this code to run so the bootstrap analysis works as written (possibly a tidy evaluation problem?) or for someone to suggest a more efficient way doing this bootstrap analysis in general.
sampler <- d[i,, drop = FALSE]
. Extraction defaults to simplifying to the least possible dimensions and sinced
is just one column, the result ofd[i,]
is a character vector, not a df. Also, when bootstrapping, set the RNG seed in order to make the results reproducible,set.seed(<integer>)
. – Rui Barradas