0
votes

I'm trying to do a random sampling by a group of variables using different proportions for each group.

For example, I want to sample iris dataset using 75% for setosa species, 80% for versicolor, and 70% for virginica.

sample_size<-data.frame(Species=c("setosa","versicolor","virginica"), prop=c(0.75,0.80,0.70))
iris2 <- merge(iris,sample_size, by="Species",all.x=TRUE)

# created a list
st <- split(iris2, iris2$Species)

set.seed(1234)

# Create the indexes: Sampling by segment using the proportions calculated above
st2 <- lapply(st, function(df) 
df <- sample(nrow(df), nrow(df)*df$prop))

# get the observations
st3 <- lapply(st, function(df,st2) 
df2 <- df[st2,])

I got correct the indexes for sampling:

$setosa
[1]  6 31 30 48 40 29  1 10 28 22 42 41 11 35 38 47 43  9 50  8 34 33  5  2 32 21 13 39 19 44 37 26 23 45  3 12 16
$versicolor
[1] 13 49 39 27 30 15 28 45 22 44 20 10 46  3 12 26 18  6 17 16 23 33 24 41  2  8  1 29 31  7 11 47 40 37 43 19 34 35 21  5
$virginica
[1]  4 16 33 44 22  7 24  9 38 49 13 45 35 39 48  5 42 50 17 10  1 43 21 30 15  8 34 36 25 23  3 29 27 40  2

But instead of getting the samples, I'm getting the entire population.

str(st3)
$ setosa    :'data.frame':  50 obs. of  6 variables:
..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Sepal.Length: num [1:50] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
..$ Sepal.Width : num [1:50] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
..$ Petal.Length: num [1:50] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
..$ Petal.Width : num [1:50] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
..$ prop        : num [1:50] 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 ...
$ versicolor:'data.frame':  50 obs. of  6 variables:
..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
..$ Sepal.Length: num [1:50] 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
..$ Sepal.Width : num [1:50] 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
..$ Petal.Length: num [1:50] 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
..$ Petal.Width : num [1:50] 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
..$ prop        : num [1:50] 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 ...
$ virginica :'data.frame':  50 obs. of  6 variables:
..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 3 3 3 3 3 3 3 3 3 3 ...
..$ Sepal.Length: num [1:50] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 ...
..$ Sepal.Width : num [1:50] 3.3 2.7 3 2.9 3 3 2.5 2.9 2.5 3.6 ...
..$ Petal.Length: num [1:50] 6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 ...
..$ Petal.Width : num [1:50] 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 ...
..$ prop        : num [1:50] 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 ...

Any help is appreciated! Thanks in advance!

1
Is the contents of st3 not what you want? - jsta
@jsta, I'd just updated the question. I'm getting the entire population instead of the samples. Thanks - Ingrid
I don't understand how your code for st3 returns anything. Your arguments to the anonymous function are backwards (first argument should be what's being lapplyd over) and you assign df2 in the function but don't return it. - mikeck

1 Answers

0
votes

I split the data.frame into a list of data frames following: https://stackoverflow.com/a/18527515/3362993 then I ran dplyr::sample_frac on each list element.

library(dplyr)

data(iris)

props <- c(setosa = 0.75, versicolor = 0.8, virginica = 0.7)
iris <- split(iris, f = iris$Species)

res <- lapply(seq_along(props), function(x) sample_frac(iris[[x]], props[x]))    
res <- do.call("rbind", res)

table(res$Species)

setosa versicolor  virginica 
    38         40         35