1
votes

I have a program where I am running a simulation function for a large number of iterations. I'm stuck, however, on what I expected to be the easiest part: figuring out how to store frequency counts of the function's results.

The simulation function itself is complicated, but is analogous to the R's sample() function. A large amount of data goes in, and the function outputs a vector containing a subset of elements.

x <- c("red", "blue", "yellow", "orange", "green", "black", "white", "pink")

run_simulation <- function(input_data, iterations = 100){
  for (i in 1:iterations){
    result <- sample(input_data, 3, replace=FALSE)
    results <- ????
  }
}

run_simulation(x)

My question is what is the best (most efficient and R-like) data structure to store the frequency counts of the results of the function inside the simulation loop. As you might be able to tell from the for loop, my background is in languages like Python, where I would create a dict keyed by tuples that increments every time a particular combination is output:

counts[results_tuple] = counts.get(results_tuple, 0) + 1

However, there is no equivalent dict/hashmap type structure in R, and I've often found that trying to emulate other languages in R is a recipe for ugly and inefficient code. (Right now I am converting the output vector to a string and appending it to a result list that I count later with table(), but that is very memory inefficient for a high number of iterations over a function that has a limited number of possible output vectors.)

To be clear, here is kind of output I want:

               Result Freq
   black, pink, green    8
     blue, red, white    7
    black, pink, blue    7
   blue, green, black    5
     blue, green, red    4
   green, blue, white    3
   pink, green, white    3
   white, blue, green    1
   white, orange, red    1
yellow, black, orange    1
  yellow, blue, green    1

I don't care about the frequency of any particular element, only the set. And I don't care about the order of output, just the frequency.

Any advice is appreciated!

3
Posted an answer to immediately realize that it was also in your description! Not too original then! And, should read more carefully the post... Anyway, response deleted. If I come with a more original way will post again.ddiez

3 Answers

1
votes

You could also use an environment (which does in fact use a hash table). In this way you do not need to enumerate all outcomes of your simulation as you are anyways just interested in the counts:

runSimulation <- function(input.size = 300L, iterations = 100L) {
   x <- paste0("E", 1L:input.size)
   results <- new.env(hash = TRUE)
   for (i in 1:iterations){
      result <- sample(x, 3, replace = FALSE)
      nam <- paste0(sort(result), collapse = ".")
      if (exists(nam, results)) {
         results[[nam]] <- results[[nam]] + 1
      } else {
         assign(nam, 1, envir = results)
      }
   }
   l <- as.list(results)
   d <- data.frame(tuple = names(l), count = unlist(l))
   rownames(d) <- NULL
   d
}

However, timewise this is comparable to the solution using table.

1
votes

The following is a short solution using base R which seems to give fairly quick execution times.

 run_simulation <- function(input_data, iterations = 100){
 Results  <-  replicate(iterations, paste0(sort(sample(input_data, 3, replace=FALSE)),collapse=", ")  )
 results <- as.data.frame(table(Results) )
 }

run_simulation(x) gives

                  Results Freq
 1     black, blue, green    2
 2    black, blue, orange    2
 3      black, blue, pink    6
 4       black, blue, red    6
 5     black, blue, white    2
 6   black, green, orange    3
 7     black, green, pink    1
 8      black, green, red    1

Benchmarking this for 100, 1,000, 10,000, and 100,000 iterations shows that the times increase linearly with the number of iterations which seems desirable. Also the total time for 100,000 iterations is about 2,200 milliseconds or 2.2 secs. You describe your simulation as complicated using a great deal of data so it may well be that the total time doing your simulation significantly exceeds the time spent in this bit of code tabulating the results.

 library(microbenchmark)

 microbenchmark(run_simulation(x,iterations=100), run_simulation(x,iterations=1000), run_simulation(x,iterations=10000), run_simulation(x,iterations=100000), times=100)

 Unit: milliseconds
                                   expr         min          lq      median          uq        max neval
    run_simulation(x, iterations = 100)    2.352262    2.447647    2.488282    2.573545   71.96314   100
    run_simulation(x, iterations = 1000)   19.161997   19.751702   20.476572   24.411885   90.42650   100
    run_simulation(x, iterations = 10000)  193.688216  208.453087  217.130138  226.166201  289.13177   100
    run_simulation(x, iterations = 1e+05) 2012.773904 2125.986609 2169.870885 2236.038487 2426.02379   100
1
votes

You can use a data.table (a juiced-up data.frame implementation) that uses possible values as key. They require a specific syntax, but are very efficient.

Here is how I would go about it. Matching simulation outputs back to the index requires sorting it, so I saved it under a new variable:

require(data.table)

x <- c("red", "blue", "yellow", "orange", "green", "black", "white", "pink")

run_simulation <- function(input_data, iterations = 100){

  # generate set of all possible outputs
  possible_values <- sort(input_data)  ## needed to match simulations

  # combn() seems to preserve input order
  # have to sort each column from combn() output if this is not guaranteed
  results <- as.data.table(t(combn(possible_values, 3)))
  setnames(results, c("first", "second", "third"))
  results[, count:=0]  ## initiate counts column
  setkey(results, first, second, third)  ## use index columns as table key

  for (i in 1:iterations){
    result <- sample(input_data, 3, replace=FALSE)
    result_sorted <- t(sort(result))  ## t() needed to specify it's a row
    colnames(result_sorted) <- c('first', 'second', 'third')
    result_sorted <- as.data.table(result_sorted)
    results[result_sorted, count:=count + 1]
  }
  return(results)
}

Most of the lines after generation are needed to get the vector into the right format for data.table to look up the correct row. This may be overkill for a small number of possible combinations, but should pay dividends if the possible set is larger.