1
votes
data1=data.frame("School"=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5),
"Group"=c(1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2),
"Class"=c('A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C'),
"Size"=c(459,441,410,201,327,156,129,427,249,331,477,458,288,472,275,449,424,469,386,387,103,320,284,277,481,167,348,247,115,193))

data2=data.frame("ID"=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30),
"Group"=c(2,2,2,2,1,1,1,2,2,2,2,2,2,1,1,2,1,2,1,1,2,2,2,2,2,1,1,1,1,1),
"Class"=c('A','B','C','B','C','C','A','A','A','B','B','A','B','A','C','C','B','A','B','A','B','C','B','A','C','B','B','C','C','B'),
"Funds"=c(5,8,9,4,6,3,5,7,6,7,6,7,6,7,6,7,4,9,5,7,5,5,5,7,9,7,6,9,4,7),
"Ratio"=c(2,2,3,1,1,2,3,3,1,3,3,3,3,1,1,3,2,1,1,1,3,1,3,3,1,2,1,3,1,3),
"WEIGHT"=c(162,65,104,118,105,49,107,93,25,24,103,115,64,186,57,123,86,181,70,111,154,135,37,73,127,148,188,169,73,52))

I have data1 and data2 and hope to provide a simple example.

In data1 you see the 'School' number and the 'Size' of the 'Group' X 'Class' combinations. In data2 you have information on 'Funds' and 'Ratio' for each of the 'Group' X 'Class' combinations. In data2 you also have 'WEIGHT' which equals to the sampling frame, which is to say that different 'ID's have different "weight" or importance.

I wish to create data3 which would look like this:

enter image description here

Overall I wish for data3 to have sum(data1$Size) rows. I wish to copy 'Size' rows for each 'School' X 'Group' X 'Class' combination from data1. Then I wish to sample with replacement by 'Group' X 'Class' 'Funds' and 'Ratio' from data2, using WEIGHTS to inform the probability of picking each row, to fill in data3.

1
@joran ........ - bvowe
If you sum the 'Size' column from data1 that will give you the expected number of rows in the output. - bvowe
Group and Class are the interrsecting column names in both datasets, but they are also duplicated. So, it is not clear how you want to merge - akrun
@akrun thanks so much. I think I have a challenge in explaining this but I will continue trying. Step 1 would be to make 'data3' by take 'data1' and expand it by 'Size'. So for example the first row is repeated 459 times, the second row is repeated 441 times, and so on and so forth. Then add empty columns 'Funds' and 'Ratio' to 'data3'. After this is complete, sample from 'Funds' and 'Ratio' from 'data2' by 'Group' and 'Class' using WEIGHTS to fill in values for 'Funds' and 'Ratio' in 'data3' - bvowe
I expanded the data1 based on the 'Size' column, then split both datasets by 'Group', 'Class' and use Map to create the columns in first dataset based on a probablity created with 'WEIGHTS/sum(WEIGHTS) - akrun

1 Answers

1
votes

Here is an option with data.table where we expand the data1 based on the 'Size' column, then split both datasets by 'Group', 'Class' and use Map to create the columns in first dataset based on a probablity created with WEIGHT/sum(WEIGHT) to be used in the sample for sampling the 'Funds', 'Ratio' column values, assign (:=) it to create new columns in the list and finally rbind the list elements withrbindlist`

library(data.table)
dt1 <- setDT(data1)[rep(seq_len(nrow(data1)), Size)]
lst1 <- split(dt1, dt1[, .(Group, Class)])
lst2 <- split(data2, data2[c('Group', 'Class')], drop = TRUE)

out <- rbindlist(Map(function(x, y) {
            prb = y$WEIGHT/sum(y$WEIGHT)
            x[, c('Funds', 'Ratio') := 
           .(sample(y$Funds,  size = .N, replace = TRUE, prob=prb), 
            sample(y$Ratio,  size = .N, replace = TRUE, prob=prb) )]}, 
            lst1, lst2[names(lst1)]))

out[, Size := NULL]
sum(data1$Size)
#[1] 9750
nrow(out)
#[1] 9750