I am in the process of optimizing my code, and I am running into some problems. I know that the greatest speed ups in R come from vectorizing code instead of using loops. However, I have my data in lists, and I am not sure if I can vectorize my code or not. I have tried using the apply
functions (like lapply
, vapply
), but I read that these functions are just for writing cleaner code and are actually using loops under the hood!
Here are my three biggest bottlenecks in my code, though I do not think anything can be done for the first part.
1) Reading data
I work with batches of 1000 matrices of dimensions 277x349. This is the biggest bottleneck in my script, but I alleviated the problem a little bit by using the doMC
package to take advantage of multiple cores with the foreach
function. This results in a list containing 1000 277x349 matrices.
For the purposes of the question, say we have a list of 1000 matrices of dimensions 277 x 349
# Fake data
l <- list()
for(i in 1:1000) {
l[[i]] <- matrix(rnorm(277*349), nrow=277, ncol=349)
}
2) Bottleneck #1
I need to make comparisons to some reference matrix (of same dimensions). This leads to comparing the 1000 matrices in my list to my reference matrix to get a vector of 1000 distances. If I know that the matrices are of the same dimensions, can I vectorize this step?
Here is some code:
# The reference matrix
r <- matrix(rnorm(277*349), nrow=277, ncol=349)
# The number of non NA values in matrix. Do not need to worry about this...
K <- 277*349
# Make a function to calculate distances
distance <- function(xi, xj, K, na.rm=TRUE) {
sqrt(sum((xi - xj)^2, na.rm=na.rm)/K)
}
# Get a vector containing all the distances
d <- vapply(l, distance, c(0), xj=r, K=K)
This step is bearably fast using vapply
, but it is the third slowest part of the code.
3) Bottleneck #2
I now want to make a weighted average matrix of the J "closest" matrices to my reference matrix. (There is a sorting step, but assume that d[1] < d[2] < ... < d[1000]
for simplicity). I want to get the weighted average matrix for when J=1,2,...,1000
# Get the weighted matrix
weightedMatrix <- function(listOfData, distances, J) {
# Calculate weights:
w <- d[1:J]^{-2} / sum(d[1:J]^{-2})
# Get the weighted average matrix
# *** I use a loop here ***
x_bar <- matrix(0, nrow=nrow(listOfData[[1]]), ncol=ncol(listOfData[[1]]))
for(i in 1:J) {
x_bar <- x_bar + {listOfData[[i]] * w[i]}
}
return(x_bar)
}
# Oh no! Another loop...
res <- list()
for(i in 1:length(l) ) {
res[[i]] <- weightedMatrix(l, d, J=i)
}
I am a little stumped. I do not see an intuitive way to vectorize operations on a list of matrices.
The script that I am writing will be called fairly often, so even a little improvement can add up!
EDIT:
RE: 1) Reading data
I forgot to mention that my data is in a special format, so I have to use a special data reading function to read the data in R. The files are in netcdf4 format, and I am using the nc_open
function from the package ncdf4
to access the files, and then I have to use the ncvar_get
function to read the variable of interest. The nice thing is that the data in the files can be read from disk, and then I can read the data into memory with ncvar_get
to do operations on them with R.
That being said, although I know the size of my matrices and how many of them I will have, I asked my question with a list of data because the foreach
function that enables me to do parallel computing outputs the results from the parallel-ized loop in a list. I found that with the foreach
function, the data reading step was about 3x faster.
I imagine that I can arrange the data as a 3d array afterwards, but maybe the time it takes to allocate the 3d array may take more time than it saves? I will have to try it tomorrow.
EDIT 2:
Here are some of the timings I took of my script.
Original Script:
[1] "Reading data to memory"
user system elapsed
176.063 44.070 26.611
[1] "Calculating Distances"
user system elapsed
2.312 0.000 2.308
[1] "Calculating the best 333 weighted matrices"
user system elapsed
63.697 28.495 9.092
I made the following improvements thus far: (1) Pre-allocate the list before reading data, (2) Improved the weighted matrix calculations, as per Martin Morgan's suggestion.
[1] "Reading data to memory"
user system elapsed
192.448 38.578 27.872
[1] "Calculating Distances"
user system elapsed
2.324 0.000 2.326
[1] "Calculating all 1000 weighted matrices"
user system elapsed
1.376 0.000 1.374
Some notes:
I use 12 cores in my foreach
loop to read in the data (registerDoMC(12)
). The whole script takes approximately 40s / 36s to run before / after the improvements.
The timing for my Bottleneck #2 has improved by quite a bit. Previously, I had been computing only the top third (i.e. 333) of the weighted matrices, but now the script can just calculate all the weighted matrices in a fraction of the original time.
Thanks for the help, I will try tweaking my code later to see if I can change my script to work with 3D arrays instead of lists. I am going to take some time now to verify the calculations just to be sure they work!
sqrt(colMeans((tmp - rr)^2,dims = 2))
doesn't appear much faster thanvapply
at all. - joranforeach
function, but I will try converting it into a 3D array to see if I can speed things up. If what @joran says is correct, I may not need to? - ialm.combine
argument offoreach
. - Roland