1
votes

I am searching for an equivalent function in R of the extremely convenient Stata command simulate. The command basically allows you to declare a program (reg_simulation in the example below) and then invoke such a program from simulate and store desired outputs.

Below is a Stata illustration of the usage of the simulate program, together with my attempt to replicate it using R.

Finally, my main question is: is this how R users will run a Montecarlo simulation? or am I missing something in terms of structure or speed bottlenecks? Thank you a lot in advance.

Stata example

  1. Defining reg_simulation program.
clear all
*Define "reg_simulation" to be used later on by "simulate" command 
program reg_simulation, rclass
    *Declaring Stata version
    version 13
    *Droping all variables on memory
    drop _all
    *Set sample size (n=100)
    set obs 100
    *Simulate model
    gen x1 = rnormal()
    gen x2 = rnormal()
    gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
    *Estimate OLS
    reg y x1 x2 
    *Store coefficients
    matrix B = e(b)
    return matrix betas = B 
end
  1. Calling reg_simulation from simulate command:
*Seet seed
set seed 1234
*Run the actual simulation 10 times using "reg_simulation"
simulate , reps(10) nodots: reg_simulation
  1. Obtained result (stored data on memory)
_b_x1   _b_x2   _b_cons
.4470155    1.50748     1.043514
.4235979    1.60144     1.048863
.5006762    1.362679    .8828927
.5319981    1.494726    1.103693
.4926634    1.476443    .8611253
.5920001    1.557737    .8391003
.5893909    1.384571    1.312495
.4721891    1.37305     1.017576
.7109139    1.47294     1.055216
.4197589    1.442816    .9404677

R replication of the Stata program above.

Using R I have managed to get the following (not an R expert tho). However, the part that worries me the most is the for-loop structure that loops over each the number of repetitions nreps.

  1. Defining reg_simulation function.
#Defining a function 
reg_simulation<- function(obs = 1000){
    data <- data.frame(
    #Generate data
    x1 <-rnorm(obs, 0 , 1) ,
    x2 <-rnorm(obs, 0 , 1) ,
    y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
  #Estimate OLS
  ols <- lm(y ~ x1 + x2, data=data)  
  return(ols$coefficients)  
}
  1. Calling reg_simulation 10 times using a for-loop structure:
#Generate list to store results from simulation
results_list <- list()
# N repetitions
nreps <- 10
for (i in 1:nreps) {
  #Set seed internally (to get different values in each run)
  set.seed(i)
  #Save results into list
  results_list[i]  <- list(reg_simulation(obs=1000))  
}
#unlist results
df_results<- data.frame(t(sapply(results_list, 
                       function(x) x[1:max(lengths(results_list))])))
  1. Obtained result: df_results.
#final results
df_results
#   X.Intercept.  x1        x2
# 1     1.0162384 0.5490488 1.522017
# 2     1.0663263 0.4989537 1.496758
# 3     0.9862365 0.5144083 1.462388
# 4     1.0137042 0.4767466 1.551139
# 5     0.9996164 0.5020535 1.489724
# 6     1.0351182 0.4372447 1.444495
# 7     0.9975050 0.4809259 1.525741
# 8     1.0286192 0.5253288 1.491966
# 9     1.0107962 0.4659812 1.505793
# 10    0.9765663 0.5317318 1.501162
2
Instead of the loop, you could just simply call replicate(10, reg_simulation())coffeinjunky
Though note you are also varying your x. You may want to consider keeping your independent variables fixed and to only vary your error terms, i.e. rnorm, but that depends on what you want to achieve. That's for both Stata and R.coffeinjunky
"to get different values in each run" you don't need to set.seed in the loop, just once outside of it will do.Rui Barradas
@RuiBarradas thank you for this, did not know about it. However, I will probably let the code running for a couple of days, and it is highly likely that the code runs into problems along for some of the iterations (numerical problems, mainly), and I would like to be able to recover the result for, let's say, iteration 2455, with the whole data simulated using seet.seed(2455). Do you think there is a better way to do it? Thank you again.Álvaro A. Gutiérrez-Vargas

2 Answers

2
votes

You're on the right track. Couple of hints/corrections:

  1. Don't use <- inside data.frame()

In R, we construct data frames using = for internal column assignment, i.e. data.frame(x = 1:10, y = 11:20) rather than data.frame(x <- 1:10, y <- 11:20).

(There's more to be said about <- vs =, but I don't want to distract from your main question.)

In your case, you don't actually even need to create a data frame since x1, x2 and y will all be recognized as "global" variables within the scope of the function. I'll post some code at the end of my answer demonstrating this.

  1. When growing a list via a for loop in R, always try to pre-allocate the list first

Always try to pre-allocate the list length and type if you are going to grow a (long) for loop. Reason: That way, R knows how much memory to efficiently allocate to your object. In the case where you are only doing 10 reps, that would mean starting with something like:

results_list <- vector("list", 10)

3. Consider using lapply instead of for

for loops have a bit of bad rep in R. (Somewhat unfairly, but that's a story for another day.) An alternative that many R users would consider is the functional programming approach offered by lapply. I'll hold off on showing you the code for a second, but it will look very similar to a for loop. Just to note quickly, following on from point 2, that one immediate benefit is that you don't need to pre-allocate the list with lapply.

4. Run large loops in parallel

A Monte Carlo simulation is an ideal candidate for running everything in parallel, since each iteration is supposed to be independent of the others. An easy way to go parallel in R is via the future.apply package.


Putting everything together, here's how I'd probably do your simulation. Note that this might be more "advanced" than you possibly need, but since I'm here...

library(data.table)   ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
plan(multisession)    ## use all available cores

obs <- 1e3

# Defining a function 
reg_simulation <- function(...){
    x1 <- rnorm(obs, 0 , 1)
    x2 <- rnorm(obs, 0 , 1)
    y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
    #Estimate OLS
    ols <- lm(y ~ x1 + x2)  
    
    # return(ols$coefficients)
    return(as.data.frame(t(ols$coefficients)))
}

# N repetitions
nreps <- 10

## Serial version
# results  <- lapply(1:nreps, reg_simulation)

## Parallel version
results  <- future_lapply(1:nreps, reg_simulation, future.seed = 1234L)

## Unlist / convert into a data.table
results <- rbindlist(results)
2
votes

So, following up on the comments, you want to vary your independent variables (x) and also the error term and simulate the coefficients, but you also want to catch errors if any occur. The following would do the trick:

set.seed(42)
#Defining a function 
reg_simulation<- function(obs = 1000){

    data <- data.frame(
    #Generate data
    x1 <-rnorm(obs, 0 , 1) ,
    x2 <-rnorm(obs, 0 , 1) ,
    y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
  #Estimate OLS

    tryCatch(
      {
        ols <- lm(y ~ x1 + x2, data=data)  
        return(ols$coefficients)      
      }, 
      error = function(e){
              return(c('(Intercept)'=NA, 'x1'=NA, 'x2'=NA))
      }
    )
    
}
output <- t(data.frame(replicate(10, reg_simulation())))
output

    (Intercept)        x1       x2
X1    0.9961328 0.4782010 1.481712
X2    1.0234698 0.4801982 1.556393
X3    1.0336289 0.5239380 1.435468
X4    0.9796523 0.5095907 1.493548
...

Here, tryCatch (see also failwith) catches the error and returns NA as the default value.

Note that you only need to set the seed once because the seed changes automatically with every call to random number generator in a deterministic fashion.