Storing Frequency Count of Simulation Function Output in R

Question

I have a program where I am running a simulation function for a large number of iterations. I'm stuck, however, on what I expected to be the easiest part: figuring out how to store frequency counts of the function's results.

The simulation function itself is complicated, but is analogous to the R's sample() function. A large amount of data goes in, and the function outputs a vector containing a subset of elements.

x <- c("red", "blue", "yellow", "orange", "green", "black", "white", "pink")

run_simulation <- function(input_data, iterations = 100){
  for (i in 1:iterations){
    result <- sample(input_data, 3, replace=FALSE)
    results <- ????
  }
}

run_simulation(x)

My question is what is the best (most efficient and R-like) data structure to store the frequency counts of the results of the function inside the simulation loop. As you might be able to tell from the for loop, my background is in languages like Python, where I would create a dict keyed by tuples that increments every time a particular combination is output:

counts[results_tuple] = counts.get(results_tuple, 0) + 1

However, there is no equivalent dict/hashmap type structure in R, and I've often found that trying to emulate other languages in R is a recipe for ugly and inefficient code. (Right now I am converting the output vector to a string and appending it to a result list that I count later with table(), but that is very memory inefficient for a high number of iterations over a function that has a limited number of possible output vectors.)

To be clear, here is kind of output I want:

               Result Freq
   black, pink, green    8
     blue, red, white    7
    black, pink, blue    7
   blue, green, black    5
     blue, green, red    4
   green, blue, white    3
   pink, green, white    3
   white, blue, green    1
   white, orange, red    1
yellow, black, orange    1
  yellow, blue, green    1

I don't care about the frequency of any particular element, only the set. And I don't care about the order of output, just the frequency.

Any advice is appreciated!

Posted an answer to immediately realize that it was also in your description! Not too original then! And, should read more carefully the post... Anyway, response deleted. If I come with a more original way will post again. — ddiez

thothal thothal · Accepted Answer · 2014-10-15T10:25:29

You could also use an environment (which does in fact use a hash table). In this way you do not need to enumerate all outcomes of your simulation as you are anyways just interested in the counts:

runSimulation <- function(input.size = 300L, iterations = 100L) {
   x <- paste0("E", 1L:input.size)
   results <- new.env(hash = TRUE)
   for (i in 1:iterations){
      result <- sample(x, 3, replace = FALSE)
      nam <- paste0(sort(result), collapse = ".")
      if (exists(nam, results)) {
         results[[nam]] <- results[[nam]] + 1
      } else {
         assign(nam, 1, envir = results)
      }
   }
   l <- as.list(results)
   d <- data.frame(tuple = names(l), count = unlist(l))
   rownames(d) <- NULL
   d
}

However, timewise this is comparable to the solution using table.

Storing Frequency Count of Simulation Function Output in R

3 Answers