Below are four different approaches. Two use functions from, respectively, the splitstackshape and sampling packages, one uses base mapply, and one uses map2 from the purrr package (which is part of the tidyverse collection of packages).
First let's set up some fake data and sampling parameters:
# Fake data
set.seed(156)
df = data.frame(age=sample(0:100, 1e6, replace=TRUE))
# Add a grouping variable for age range
df = df$age.groups = cut(df$age, c(0,30,51,70,Inf), right=FALSE)
# Total number of people sampled
n = 1000
# Named vector of sample proportions by group
probs = setNames(c(0.3, 0.3, 0.2, 0.2), levels(df$age.groups))
Using the above sampling parameters, we want to sample n total values with a proportion probs from each age group.
Option 1: mapply
mapply can apply multiple arguments to a function. Here, the arguments are (1) the data frame df split into the four age groupings, and (2) probs*n, which gives the number of rows we want from each age group:
df.sample = mapply(a=split(df, df$age.groups), b=probs*n,
function(a,b) {
a[sample(1:nrow(a), b), ]
}, SIMPLIFY=FALSE)
mapply returns a list with of four data frames, one for each stratum. Combine this list into a single data frame:
df.sample = do.call(rbind, df.sample)
Check the sampling:
table(df.sample$age.groups)
[0,30) [30,51) [51,70) [70,Inf)
300 300 200 200
Option 2: stratified function from the splitstackshape package
The size argument requires a named vector with the number of samples from each stratum.
library(splitstackshape)
df.sample2 = stratified(df, "age.groups", size=probs*n)
Option 3: strata function from the sampling package
This option is by far the slowest.
library(sampling)
# Data frame must be sorted by stratification column(s)
df = df[order(df$age.groups),]
sampled.rows = strata(df, 'age.groups', size=probs*n, method="srswor")
df.sample3 = df[sampled.rows$ID_unit, ]
Option 4: tidyverse packages
map2 is like mapply in that it applies two arguments in parallel to a function, in this case the dplyr package's sample_n function. map2 returns a list of four data frames, one for each stratum, which we combine into a single data frame with bind_rows.
library(dplyr)
library(purrr)
df.sample4 = map2(split(df, df$age.groups), probs*n, sample_n) %>% bind_rows
Timings
library(microbenchmark)
Unit: milliseconds
expr min lq mean median uq max neval cld
mapply 86.77215 110.82979 156.66855 123.95275 145.25115 486.2078 10 a
strata 5028.42933 5541.40442 5709.16796 5699.50711 5845.69921 6467.7250 10 b
stratified 38.33495 41.76831 89.93954 45.43525 79.18461 408.2346 10 a
tidyverse 71.48638 135.49113 143.12011 142.86866 155.72665 192.4174 10 a