0
votes

The generic version of what I am trying to do is to conduct a simulation study where I manipulate a few variables to see how that impacts a result. I'm having some speed issues with R. The latest simulation worked with a few iterations (10 per experiment). However, when I moved to a large scale (10k per experiment) version, the simulation has been running for 14 hours (and is still running).

Below is the code (with comments) that I am running. Being a rookie with R, and am struggling to optimize the simulation to be efficient. My hope is to learn from the comments and suggestions provided here to optimize this code and use these comments for future simulation studies.

Let me say a few things about what this code is supposed to do. I am manipulating two variables: effect size and sample size. Each combination is run 10k times (i.e., 10k experiments per condition). I initialize a data frame to store my results (called Results). I loop over three variables: Effect size, sample size, and iterations (10k).

Within the loops, I initialize four NULL components: p.test, p.rep, d.test, and d.rep. The former two capture the p-value of the initial t-test and the p-value of the replication (replicated under similar conditions). The latter two calculate the effect size (Cohen's d).

I generate my random data from a standard normal for the control condition (DVcontrol), and I use my effect size as the mean for the experimental condition (DVexperiment). I take the difference between the values and throw the result into the t-test function in R (paired-samples t-test). I store the results in a list called Trials and I rbind this to the Results data frame. This process is repeated 10k times until completion.

# Set Simulation Parameters
## Effect Sizes (ES is equal to mean difference when SD equals Variance equals 1)
effect_size_range <- seq(0, 2, .1) ## ES
## Sample Sizes
sample_size_range <- seq(10, 1000, 10) ## SS
## Iterations for each ES-SS Combination
iter <- 10000

# Initialize the Vector of Results
Results <- data.frame()

# Set Random Seed
set.seed(12)

# Loop over the Different ESs
for(ES in effect_size_range) {

  # Loop over the Different Sample Sizes
  for(SS in sample_size_range) {

    # Create p-value Vectors
    p.test <- NULL
    p.rep <- NULL
    d.test <- NULL
    d.rep <- NULL

    # Loop over the iterations
    for(i in 1:iter) {

      # Generate Test Data
      DVcontrol <- rnorm(SS, mean=0, sd=1)
      DVexperiment <- rnorm(SS, mean=ES, sd=1)
      DVdiff <- DVexperiment - DVcontrol
      p.test[i] <- t.test(DVdiff, alternative="greater")$p.value
      d.test[i] <- mean(DVdiff) / sd(DVdiff)

      # Generate Replication Data
      DVcontrol <- rnorm(iter, mean=0, sd=1)
      DVexperiment <- rnorm(iter, mean=ES, sd=1)
      DVdiff <- DVexperiment - DVcontrol
      p.rep[i] <- t.test(DVdiff, alternative="greater")$p.value
      d.rep[i] <- mean(DVdiff) / sd(DVdiff)
    }

    # Results
    Trial <- list(ES=ES, SS=SS,
                  d.test=mean(d.test), d.rep=mean(d.rep),
                  p.test=mean(p.test), p.rep=mean(p.rep),
                  r=cor(p.test, p.rep, method="kendall"),
                  r.log=cor(log2(p.test)*(-1), log2(p.rep)*(-1), method= "kendall"))
    Results <- rbind(Results, Trial)
  }
}

Thanks in advance for your comments and suggestions, Josh

1
This seems like it belongs on Code Review rather than here.John Coleman
@JohnColeman I think it's on-topic on both sites. It's a specific question ("how can I speed up this code? it takes over a day") as well as a request for a code review.Zeta
I can remove this post and move it over to Code Review, if necessary.Josh
@Josh On further thought, I think that Zeta is right that leaving it here is okay. Arguably if you can't get any output after 14 hours, it really isn't working at all.John Coleman
In my opinion, this question could be acceptable on Stack Overflow if you could ask a specific question in the title. "Optimize my simulation" is primarily about improving your code, and thus not a specific request. "How can I run a simulation with two variables in parallel?", however, would be one way to ask about a specific programming problem.200_success

1 Answers

3
votes

The general approach to optimization is to run a profiler to determine what portion of the code the interpreter spends the most time in, and then to optimize that portion. Let's say your code resides in a file called test.R. In R, you can profile it by running the following sequence of commands:

Rprof()              ## Start the profiler
source( "test.R" )   ## Run the code
Rprof( NULL )        ## Stop the profiler
summaryRprof()       ## Display the results

(Note that these commands will generate a file Rprof.out in the directory of your R session.)

If we run the profiler on your code (with iter <- 10, rather than iter <- 10000), we get the following profile:

# $by.self
#                         self.time self.pct total.time total.pct
# "rnorm"                      1.56    24.53       1.56     24.53
# "t.test.default"             0.66    10.38       2.74     43.08
# "stopifnot"                  0.32     5.03       0.86     13.52
# "rbind"                      0.32     5.03       0.52      8.18
# "pmatch"                     0.30     4.72       0.34      5.35
# "mean"                       0.26     4.09       0.42      6.60
# "var"                        0.24     3.77       1.38     21.70

From here, we observe that rnorm and t.test are your most expensive operations (shouldn't really be a surprise as these are in your inner-most loop).

Once you figured out where the expensive function calls are, the actual optimization consists of two steps:

  1. Optimize the function, and/or
  2. Optimize the number of times the function is called.

Since t.test and rnorm are built-in R functions, your only option for Step 1 above is to look for alternative packages that may have faster implementations of sampling from the normal distribution and/or running multiple t tests. Step 2 is really about restructuring your code in a way that does not recompute the same thing multiple times. For example, the following lines of code do not depend on i:

# Generate Test Data
DVcontrol <- rnorm(SS, mean=0, sd=1)
DVexperiment <- rnorm(SS, mean=ES, sd=1)

Does it make sense to move these outside the loop, or do you really need a new sample of your test data for each different value of i?