Sample n random rows per group in a dataframe

29

votes

From these questions - Random sample of rows from subset of an R dataframe & Sample random rows in dataframe I can easily see how to randomly sample (select) 'n' rows from a df, or 'n' rows that originate from a specific level of a factor within a df.

Here are some sample data:

df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

df[sample(nrow(df), 3), ] #samples 3 random rows from df, without replacement.

To e.g. just sample 3 random rows from 'pink' color - using library(kimisc):

library(kimisc)
sample.rows(subset(df, color == "pink"), 3)

or writing custom function:

sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE]
sample.df(subset(df, color == "pink"), 3)

However, I want to sample 3 (or n) random rows from each level of the factor. I.e. the new df would have 12 rows (3 from blue, 3 from red, 3 from yellow, 3 from pink). It's obviously possible to run this several times, create newdfs for each color, and then bind them together, but I am looking for a simpler solution.

r randomdataframesample

7

votes

You can assign a random ID to each element that has a particular factor level using ave. Then you can select all random IDs in a certain range.

rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]

This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid vector to create subset of different lengths fairly easily.

34

votes

In versions of dplyr 0.3 and later, this works just fine:

df %>% group_by(color) %>% sample_n(size = 3)

Old versions of `dplyr` (version <= 0.2)

I set out to answer this using dplyr, assuming that this would work:

df %.% group_by(color) %.% sample_n(size = 3)

But it turns out that in 0.2 the sample_n.grouped_df S3 method exists but isn't registered in the NAMESPACE file, so it's never dispatched. Instead, I had to do this:

df %.% group_by(color) %.% dplyr:::sample_n.grouped_df(size = 3)
Source: local data frame [12 x 3]
Groups: color

            X1         X2  color
8   0.66152710 -0.7767473   blue
1  -0.70293752 -0.2372700   blue
2  -0.46691793 -0.4382669   blue
32 -0.47547565 -1.0179842   pink
31 -0.15254540 -0.6149726   pink
39  0.08135292 -0.2141423   pink
15  0.47721644 -1.5033192    red
16  1.26160230  1.1202527    red
12 -2.18431919  0.2370912    red
24  0.10493757  1.4065835 yellow
21 -0.03950873 -1.1582658 yellow
28 -2.15872261 -1.5499822 yellow

Presumably this will be fixed in a future update.

7

votes

I would consider my stratified function, which is presently hosted as a GitHub Gist.

Get it with:

library(devtools)  ## To download "stratified"
source_gist("https://gist.github.com/mrdwab/6424112")

And use it with:

stratified(df, "color", 3)

There are several different features that are convenient for stratified sampling. For instance, you can also take a sample sort of "on the fly".

stratified(df, "color", 3, select = list(color = c("blue", "red")))

To give you a sense of what the function does, here are the arguments to stratified:

df: The input data.frame
group: A character vector of the column or columns that make up the "strata".
size: The desired sample size.
- If size is a value less than 1, a proportionate sample is taken from each stratum.
- If size is a single integer of 1 or more, that number of samples is taken from each stratum.
- If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
replace: For sampling with replacement.

6

votes

Here's a solution. We split a data.frame into color groups. Then we sample 3 rows from each group. This yields a list of data.frames.

df2 <- lapply(split(df, df$color),
   function(subdf) subdf[sample(1:nrow(subdf), 3),]
)

To obtain the desired result, we merge the list of data.frames into 1 data.frame:

do.call('rbind', df2)
##                    X1          X2  color
## blue.3    -1.22677188  1.25648082   blue
## blue.4    -0.54516686 -1.94342967   blue
## blue.1     0.44647071  0.16283326   blue
## pink.40    0.23520296 -0.40411906   pink
## pink.34    0.02033939 -0.32321309   pink
## pink.33   -1.01790533 -1.22618575   pink
## red.16     1.86545895  1.11691250    red
## red.11     1.35748078 -0.36044728    red
## red.13    -0.02425645  0.85335279    red
## yellow.21  1.96728782 -1.81388110 yellow
## yellow.25 -0.48084967  0.07865186 yellow
## yellow.24 -0.07056236 -0.28514125 yellow

0

votes

Here is a way, in base, that allows for multiple groups and sampling with replacement:

n <- 3
resample <- TRUE
index <- 1:nrow(df)
fun <- function(x) sample(x, n, replace = resample)
a <- aggregate(index, by = list(group = df$color), FUN = fun )

df[c(a$x),]

To add another group, include it in the 'by' argument to aggregate.

Sample n random rows per group in a dataframe

5 Answers

Old versions of dplyr (version <= 0.2)

Old versions of `dplyr` (version <= 0.2)