I have some data where the summary of the number of observations looks like:
# A tibble: 14 x 3
# Groups: status [2]
status year n
<dbl> <dbl> <int>
1 0 2010 4593
2 0 2011 10990
3 0 2012 27711
4 0 2013 99989
5 0 2014 95407
6 0 2015 89010
7 0 2016 72289
8 1 2010 584
9 1 2011 785
10 1 2012 640
11 1 2013 667
12 1 2014 377
13 1 2015 460
14 1 2016 104
Where the class of one group is signficantly higher than the class of another group. How can I randomly sample the class of 0 without doing anything to the class of 1. That is, I would like to keep all class 1 observations and randomly sample the class 0 observations by 4593 (which is the minimum number of observations for that year)
Using group_by(status, year)
and then sample_n()
doesn't work since the 4593 value is greater than the values in the class 1 group.
Some random sample of my data:
structure(list(status = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
year = c(2013, 2014, 2012, 2013, 2016, 2013, 2015, 2014,
2013, 2016, 2015, 2016, 2011, 2014, 2016, 2012, 2013, 2012,
2014, 2014, 2012, 2012, 2012, 2016, 2016, 2012, 2016, 2015,
2013, 2014, 2015, 2013, 2015, 2015, 2014, 2015, 2011, 2014,
2013, 2012, 2011, 2016, 2015, 2015, 2015, 2014, 2012, 2013,
2015, 2012, 2015, 2016, 2015, 2013, 2014, 2014, 2014, 2013,
2013, 2016, 2016, 2013, 2015, 2012, 2014, 2014, 2013, 2015,
2014, 2016, 2016, 2014, 2012, 2016, 2013, 2010, 2011, 2014,
2016, 2013, 2016, 2014, 2014, 2013, 2013, 2013, 2016, 2016,
2012, 2014, 2013, 2015, 2016, 2013, 2013, 2015, 2013, 2014,
2013, 2015, 2013, 2013, 2011, 2014, 2016, 2013, 2010, 2012,
2014, 2012, 2011, 2011, 2013, 2015, 2014, 2010, 2010, 2013,
2010, 2014, 2011, 2011, 2014, 2013, 2014, 2015, 2015, 2013,
2014, 2013, 2011, 2013, 2014, 2013, 2011, 2013, 2012, 2015,
2012, 2012, 2012, 2010, 2013, 2013, 2011, 2011, 2011, 2012,
2016, 2013, 2011, 2011, 2012, 2012, 2014, 2010, 2013, 2014,
2011, 2012, 2010, 2012, 2012, 2011, 2015, 2011, 2011, 2013,
2015, 2010, 2015, 2011, 2015, 2015, 2012, 2012, 2013, 2012,
2014, 2014, 2012, 2012, 2014, 2010, 2011, 2013, 2014, 2012,
2013, 2016, 2014, 2012, 2012, 2013, 2010, 2012, 2013, 2014,
2014, 2011)), groups = structure(list(status = c(0, 1), .rows = structure(list(
1:100, 101:200), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), row.names = c(NA, -200L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
sample_n()
wheresize = 4593
? – EJJstatus
variable which has 2 clases0
and1
. I would like togroup_by
orfilter
the class0
variable and then take a random sample of these for each year. Using something likesample_n()
withsize = 4593
is what I am looking for yes, but this does not work when doingdata %>% group_by(status, year) %>% sample_n(size = 4593)
since it returnsError: size must be less or equal than 584 (size of data), set replace = TRUE to use sampling with replacement
. Settingreplace = TRUE
doesn't give me the correct output either. – user8959427