1
votes

I want to sample 5 random rows 1,000 times and summarize them in a data frame. I have a problem with the replace = FALSE and I wonder where to put it to replace = TRUE.

I have a dataset of 5,000 rows which looks (simplified) like this:

 Fund.ID Vintage Type Region.Focus Net.Multiple  Size
[1,] 4716  2003  2    US           1.02          Small
[2,] 2237  1998  25   Europe       0.03          Medium
[3,] 1110  1992  2    Europe       1.84          Medium
[4,] 12122 1997  25   Asia         2.04          Large 
[5,] 5721  2006  25   US           0.86          Mega
[6,] 730   1998  2    Europe       0.97          Small

This is my function which starts with one random row and includes a constraint for the 5 rows being drawn.:

       simulate <- function(inv.period) {
          start <- sample_n(dataset, 1, replace=TRUE) #draw random first fund
          t <- start$Vintage:(start$Vintage + inv.period) #define investment period contingent on first fund
          fof <- dataset[sample(which(dataset$Vintage %in% t), 5, replace = FALSE), ] #include constraint, 5 funds in portfolio
        }

#replicate this function 1,000 times 
#and give out as a data frame with portfolios classified
        library(plyr)
        library(dplyr)
        fof.5 <- rdply(1000, simulate(4))
        rename(fof.5, FoF.ID = .n)

If I use replace=FALSE in the simulate function (after fof <-), I get this error:

Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE' The whole expression works if I put replace = TRUE. However, this would not be correct, as a row could be drawn twice in the same sample, which I do not want.

Is there a way to put replace=FALSE when rows are drawn, but put replace=TRUE for the overall dataset? It should be: A row can be drawn only once within the sample but can be drawn another time in another sample.

1
If you use a function from a package please indicate with library(package) it helps others to replicate your code and find solutions. - Pierre L
The function simulate does not return any value. It will also fail any time the length of t is less than 5. For example, let's say start returns row 4 from its sample. Then start$Vintage will be 1997. Now let's say inv.period is 1. Two values are being sampled, rows 2 and 4. You are asking for 5 values to be extracted without replacement. That doesn't make sense. - Pierre L
I am using the plyr and dplyr packages, as indicated in the second code box. It is true that my simulate function does not return any value. That is because i store the replicated function output as a dataframe with the rdply function. Since I have a large dataset (5,000 rows) with years 1982-2015, your second point raised should not result in problems. - Toto
fof.5 <- do.call( rbind, replicate(1000, simulate(5), simplify=FALSE ) ) works, but I cannot distinguish draw 1 from draw 2 etc. I want to add an additional column with variable "FoF" which is equal to 1 for the first sample, 2 for the second etc. - Toto
What do you think is the difference between sample_n(dataset, 1, replace=TRUE) and sample_n(dataset, 1, replace=FALSE)? - Pierre L

1 Answers

0
votes

I would suggest taking out the dplyr stuff, there is no need for it. Secondly, add a variable for the matches called matches to then sample the length of that vector or the number 5, whichever is smaller. Lastly, I would use data.table::rbindlist, it has an argument to create an index indicating which draw was taken. The output will be a data.table, if you are not familiar with it, you can use as.data.frame(rbindlist(....)) at the end to turn it back to a data.frame.:

library(data.table)
simulate <- function(inv.period) {
  start <- dataset[sample(nrow(dataset), 1, replace=TRUE),]
  t <- start$Vintage:(start$Vintage + inv.period)
  matches <- which(dataset$Vintage %in% t)
  dataset[sample(matches, min(length(matches),5), replace = FALSE), ]
}

r <- replicate(1000, simulate(5), simplify=FALSE)
rbindlist(r, idcol="draw")