Sampling without replacement from multiple vectors of different length using vector lengths as some sort of weight

Question

I want to take random samples from multiple vectors of different length using vector lengths as some sort of weight, such that more samples are drawn from vectors of larger sizes when compared to smaller ones (proportional sampling of sorts).

To illustrate my point please consider this:

# Generating 100 different individuals
vec1 <- rep( letters , length.out = 100 )
vec2 <- c(1:100)

# Join two above vectors
students <- paste( vec1 , vec2 , sep="" )

The above produces a giant vector of 100 students. Now I am trying to generate 10 random vectors from which the final sampling has to take place.

# Creating 10 vectors of different sizes
a <- split( students , sample(10, 100 , repl = TRUE) )
vec1 <- a$`1`
vec2 <- a$`2`
vec3 <- a$`3`
vec4 <- a$`4`
vec5 <- a$`5`
vec6 <- a$`6`
vec7 <- a$`7`
vec8 <- a$`8`
vec9 <- a$`9`
vec10 <- a$`10`

So, now I have 10 vectors (vec1...vec10) of varying sizes. My goal is to get a final vector with a total of 50 random samples from all the vectors, such that when sampling is done it would be wrt vector length i.e., proportional sampling.

Is something like this possible?

Apologies if this has been asked before!

mickey mickey · Accepted Answer · 2018-11-14T23:28:29

This will get you approximately 50 students (depending on how a was split)

new = unlist(lapply(a, function(x) sample(x, round(length(x)/2))))

To get exactly 50 each time, you can do this

ll = sapply(a, length)   # Get length of each vector in "a"
target = 50
new_ll = 0
while (sum(new_ll) != target)
    new_ll = round(ll * target / sum(ll) + runif(length(ll), -0.5, 0.5))

new = unlist(lapply(1:length(a), function(i) sample(a[[i]], new_ll[i])))

Explanation: Get the length of each vector in a and assign to ll. This amounts to doing ll[1] = length(vec1); ll[2] = length(vec2) and so on. We need to sample a certain amount from each vector in a such that we get 50 elements (target). This amount is determined with new_ll. It is approximately equal to target / num_students times each vector length.

Since this does not guarantee that target students are selected each time, we add a little jitter with runif to move the numbers around slightly, and we continue looping until the the sum of new_ll is equal to target.

The final line then iterates i from 1 through 10 (or the number of vectors in a) and samples new_ll[i] from each vector a[[i]].

Sampling without replacement from multiple vectors of different length using vector lengths as some sort of weight

1 Answers