simulation in R with nested loops run slow

Question

I am using R for agent-based historical simulation and the code works but slowly. It loops through timesteps updating a dataframe of attributes of agents, and another with summary of overall state after each timestep (a generation). Looping above that are a few runs of each different parameter setting. Though it begins with 100 agents, under extreme settings (high S, low A) after e.g. five generations the population can grow above a thousand. I read that updating a matrix is faster than dataframe so I converted summary to a matrix. But I also hear that vectorisation is best so before I change agents to matrix I wonder if anyone please could suggest a way to make it more vectorised? Here is the code:

NextGeneration <- function(agent, N, S, A) {
   # N is number of agents.
   # S is probability that an agent with traditional fertility will have 2 sons surviving to the age of inheritance.
   # A is probability that an heir experiencing division of estate changes his fertility preference from traditional to planned.
   # find number of surviving heirs for each agent
   excess <- runif(N)  # get random numbers 
   heir <- rep(1, N)  # everyone has at least 1 surviving heir 

   # if agent has traditional fertility 2 heirs may survive to inherit
   heir[agent$fertility == "Trad" & excess < S] <- 2  

   # next generation more numerous if spare heirs survive

   # new agents have vertical inheritance but also guided variation. 
   # first append to build a vector, then combine into new agent dataframe  
   nextgen.fertility <- NULL
   nextgen.lineage <- NULL

   for (i in 1:N) {

      if (heir[i]==2) {

         # two agents inherit from one parent.
         for (j in 1:2) {

            # A is probability of inheritance division event affecting fertility preference in new generation.
            if (A > runif(1)) {
               nextgen.fertility <- c(nextgen.fertility, "Plan") 
            } else {
               nextgen.fertility <- c(nextgen.fertility, agent$fertility[i])
            }
            nextgen.lineage <- c(nextgen.lineage, agent$lineage[i])
         }
      } else {
         nextgen.fertility <- c(nextgen.fertility, agent$fertility[i])
         nextgen.lineage <- c(nextgen.lineage, agent$lineage[i])
      }
   }
   # assemble new agent frame  
   nextgen.agent <- data.frame(nextgen.fertility, nextgen.lineage, stringsAsFactors = FALSE) 
   names(nextgen.agent) <- c("fertility", "lineage")
   nextgen.agent
}

So the agents begin like this (Trad = traditional):

ID      fertility   lineage,
1       Trad        1
2       Trad        2
3       Trad        3
4       Trad        4
5       Trad        5

and after a few timesteps (generations) of random changes end up something like this:

ID      fertility   lineage
1       Plan       1
2       Plan       1
3       Trad       2
4       Plan       3
5       Trad       3
6       Trad       4
7       Plan       4
8       Plan       4
9       Plan       4
10      Plan       5
11      Trad       5

I tried to run your example but the variables "nextgen.fert" and "nextgen.line" are not defined — Esteban PS

RolandASc RolandASc · Accepted Answer · 2018-02-04T20:51:43

Indeed, it would be more efficient to encode fertility with 0 and 1, and you could even have an integer matrix.

Anyhow, the code as it stands can be simplified a lot - so here is a vectorized solution, still using your data.frame:

NextGen <- function(agent, N, S, A) {
  excess <- runif(N)
  v1 <- which(agent$fertility == "Trad" & excess < S)
  nextgen.agent <- agent[c(1:N, v1), ]
  nextgen.agent[c(v1, seq.int(N+1, nrow(nextgen.agent))), "fertility"] <- ifelse(A > runif(length(v1)*2), "Plan", "Trad")
  nextgen.agent
}

Testing with a sample agent DF as follows:

agentDF <- data.frame(fertility = "Trad", lineage = 1:50, stringsAsFactors = FALSE)

# use microbenchmark library to compare performance
microbenchmark::microbenchmark(
  base = {
    res1 <- NextGeneration(agentDF, 50, 0.8, 0.8) # note I fixed the two variable typos in your function
  }, 
  new = {
    res2 <- NextGen(agentDF, 50, 0.8, 0.8)
  }, 
  times = 100
)

## Unit: microseconds
## expr      min        lq     mean    median       uq       max neval
## base 1998.533 2163.8605 2446.561 2222.8200 2286.844 14413.173   100
##  new  282.032  304.1165  329.552  320.3255  348.488   467.217   100

simulation in R with nested loops run slow

1 Answers