Maybe this is overkill for your question, but I've written a function called stratified that should work for you.
The features include:
- Allowing the user to specify multiple grouping variables.
- Sampling a fixed number of rows from each group.
- Allowing the user to specify which subsets from the grouping variables should be considered when sampling.
Here's an example:
## (Or just copy and paste the function in your session)
library(devtools)
source_gist("https://gist.github.com/mrdwab/6424112")
stratified(mtcars, "gear", 3)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Here are some more examples:
If we considered "carb" and "cyl" as our grouping variables, note that some of the combinations have fewer than 3 rows of data:
table(interaction(mtcars[c("carb", "cyl")], drop = TRUE))
#
# 1.4 2.4 1.6 4.6 6.6 2.8 3.8 4.8 8.8
# 5 6 2 4 1 4 3 6 1
This is how stratified would work, along with the warning it generates:
out1 <- stratified(mtcars, c("carb", "cyl"), 3)
# Some groups
# ---1.6, 6.6, 8.8---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
Note the rows returned from the above statement, and inspect the first few rows of the result.
table(interaction(out1[c("carb", "cyl")], drop = TRUE))
#
# 1.4 2.4 1.6 4.6 6.6 2.8 3.8 4.8 8.8
# 3 3 2 3 1 3 3 3 1
head(out1)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
You can also "subset" while sampling. For example, if you only wanted "carb" values of 1, 2, and 4, and "cyl" values of 4 and 8 to be included, you can do:
out2 <- stratified(mtcars, c("carb", "cyl"), 3,
select = list(carb = c(1, 2, 4),
cyl = c(4, 8)))
out2
# mpg cyl disp hp drat wt qsec vs am gear carb
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
The size argument also accepts a value less than 1 if you wanted to take a percentage of each group. For example, setting size = .25 would sample 25% (rounded) of each group.