3
votes

This question is not about sampling data, I know about sample_n but this question is about simulating data from a dataframe to compare their mean from simulation Vs actual (using group_by summarise).

I calculated the actual difference in mean between using below

df %>% 
  group_by(allfour) %>% 
  summarise(hs_completion=mean(hsgrad),
            count=n())

However, I am struggling to draw 100 simulations from each group & then divide each vector by the respective group size to turn these into simulated graduation rates & calculate difference in these rates between two groups. Post this, I need to draw a histogram of these simulated differences & add a red vertical line to this histogram at the value of difference-in-means calculated in the observed data.

I know tidyverse & ggplot, so plotting is not an issue just how do I 100 simulations when the records are limited.

Sample of Dataframe df as below:

    structure(list(hsgrad = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 
0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 
1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 
0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 
1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 
1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L), allfour = structure(c(1L, 
2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 1L), .Label = c("0", "1"), class = "factor")), row.names = c(NA, 
100L), class = "data.frame")
1
what do you mean by draw 100 simulations? Is it sub-sampling i.e choose 20 from each group, or is it sample with replacement? - StupidWolf
Sample with replacement - Vaibhav Singh
By 100 simulations, I referred size of each sample @StupidWolf - Vaibhav Singh
so 1 draw with replacement? once you do group_by(..) you just sample_n() right - StupidWolf
no, I am aware of sample_n(), but it is to create simulation from this data - Vaibhav Singh

1 Answers

2
votes

The important information is in this line:

enter image description here

So you need to simulate bernoulli with this probability of success in each group. We calculate the overall success (graduation) rate:

rate = mean(df$hsgrad)

The basic code for 1 simulation is such, you give the number of simulations (1000), the number of trials (i.e the size of groups) and the rate of success (from above) :

sim_1 = rbinom(1000,sum(df$allfour==1),prob=rate)
hist(sim_1/sum(df$allfour==1),br=20)

enter image description here

This gives you the simulated probability of success in the allfour==1 group, under the assumption the rate is the overall rate. Now we just need to do this for two groups:

grp0_size = sum(df$allfour==0)
grp1_size = sum(df$allfour==1)
nsim = 1000
observed = diff(tapply(df$hsgrad,df$allfour,mean))

data.frame(
grp0_success = rbinom(nsim,grp0_size,rate)/grp0_size,
grp1_success = rbinom(nsim,grp1_size,rate)/grp1_size) %>%
mutate(diff=grp1_success-grp0_success) %>%
ggplot(aes(x=diff)) + geom_histogram() +
geom_vline(xintercept=observed)

enter image description here