Best approach for masking and predicting rasters (stratified model): maximise performance

Question

This is a question on performance: The code below works well with the dummy data in the example, but scaling the problem to bigger data became very slow. I'm wondering if there is a better approach to do the same things.

The problem

I want to spatially predict a dependent variable y (forest growing stock volume in my real case study) based on a predictor x (forest age in my case study). I also have a forest types map (18 forest types) and I've built a regression model (SVM with radial kernel) for each forest type. For each forest type, I want to predict y based on x. I have annual data for the predictor x (forest age), so these operations must be done for every year (6 years in the example).

Scalability

My problem is the computational time. My real rasters are quite big, and the analysis has to be repeated for 15 years. Below an example of my real rasters

#forest types map (this is a spatRast object created with terra::separate)
Class: RasterStack 
dimensions : 52175, 43480, 2268569000, 18  (nrow, ncol, ncell, nlayers)
resolution : 23, 23  (x, y)
extent     : 313260, 1313300, 4020285, 5220310  (xmin, xmax, ymin, ymax)
crs        : +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs 
names      : X224, X324, X3111, X3112, X3113, X3114, X3115, X3116, X3117, X3121, X3122, X3123, X3124, X3125, X3131, ... 
min values :    1,    1,     1,     1,     1,     1,     1,     1,     1,     1,     1,     1,     1,     1,     1, ... 
max values :    1,    1,     1,     1,     1,     1,     1,     1,     1,     1,     1,     1,     1,     1,     1, ...


#example of one predictor (this is a RasterLayer object, read in R with raster::raster)
class      : RasterLayer 
dimensions : 52175, 43480, 2268569000  (nrow, ncol, ncell)
resolution : 23, 23  (x, y)
extent     : 313260, 1313300, 4020285, 5220310  (xmin, xmax, ymin, ymax)
crs        : +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs 
source     : E:/Elia_backup/FOR-EST/prova/disturbi_anni/2020.tif 
names      : X2020 
values     : 1, 35  (min, max)

Here is my current approach (adapted to the example data created ad hoc):

Create example data

library(raster)
library(terra)
library(ggplot2)
library(caret)
library(snowfall)

# generate data --------------------------------------------------------

set.seed(5)
#stack of predictor for 6 consecutive years
r1 <-r2 <- r3 <- r4 <- r5 <- r6 <-  m <- raster()
r1[] <-sample(1:16,ncell(r1),replace = T) 
r2[] <- sample(1:20,ncell(r1),replace = T)
r3[] <-sample(1:24,ncell(r1),replace = T)
r4[] <-sample(1:28,ncell(r1),replace = T)
r5[] <-sample(1:32,ncell(r1),replace = T)
r6[] <- sample(1:35,ncell(r1),replace = T)
#this stack contain the predictor for each year of the analysis
year_cut <- stack(r1,r2,r3,r4,r5,r6)

#generate the raster mask (simulate land uses)
m[] <- rep(1:10,each=ncell(r1)/10)
#plot
plot(m)
#transform the land uses map in RasterStack with one layer for each land use
m <- separate(rast(m),other=NA)
m <- stack(m)
#rename layers
names(m) <- paste0("type",gsub("X","",names(m)))

#generate data frame of field observation
a <- sample(1:35,500,replace = T) 
g <- sample(seq(10,100,10),500,replace = T)
v <- ((a*runif(500)*g)^2+runif(500))/10000
s <- sample(names(m),500,replace = T)
#predictive models will be built form this data frame
df <- data.frame(a,v,s)
#plot
ggplot(df,aes(x=a,y=v,col=s))+
  geom_point()+
  facet_wrap(~s,scales = "free")

Below I fit one model for each forest type

# models ------------------------------------------------------------------

#function for model fitting based on the land uses
mod_fit <- function(s){
  sample <- df[df$s==s,]
  fit <- train(v~a,data=sample,method="svmRadial",tuneLength=5)
}
#fit one model for each land use
models <- lapply(names(m),mod_fit)
#rename the list of models to match the name of the land use
names(models) <- names(m)

And finally, mask predictor in turn with the forest types and predict y (forest growing stock volume) with the correspondent model. In this piece of code, I iterate the operations for every year (that is, for every raster predictors; in my example, for every layer of year_cut)

# mask and predict --------------------------------------------------------

#function for mask and predict
mypred <- function(x){
  #land use
  type <- names(x)
  #mask the predictor (cut, defined below) with the land use in "type"
  masked <- raster::mask(cut,x)
  #rename for predict
  names(masked) <- "a"
  #predict v~a based on the masked predictor, with the right model
  predicted <- raster::predict(masked,model=models[[which(names(models)==type)]],overwite=T)
}


# loop across "years_cut" ---------------------------------------------------

#here I store my results
tot <- as.list(1:nlayers(year_cut))


for (i in 1:nlayers(year_cut)) {
  #read the predictor for the current year
  cut <- year_cut[[i]]
  #unstack the land use mask
  l.list <- unstack(m)
  names(l.list) <- names(m)
  #set a parallel cluster with as many core as layer in the land use mask
  sfInit(parallel = TRUE, cpus =nlayers(m))
  # Load the required packages inside the cluster
  sfLibrary(raster)
  #export all variable in all the cluster
  sfExportAll()
  # Run parallelized function and stop cluster
  system.time(pred <- sfLapply(l.list, fun =mypred))
  sfStop()  
  #finally, summarize the prediction of each year in one single raster
  pred_stack <- stack(pred)
  #calc in parallel to make the sum of each layer in "pred_stack"
  beginCluster(10)
  tot[[i]] <-clusterR(pred_stack, calc, args = list(fun = sum, na.rm = T), m = 1)
  endCluster()
  
}#140 sec

This took 140 seconds on my PC, but on the real data, it took over 2 days to run. I've tried to substitute the raster functions with the terra equivalent but, strangely, with terra I encounter memory issues, especially in the mask and predict part.

I know that the predict part is a typical Map problem. I'm wondering if I can substitute sfLapply with a parallel Map equivalent, like future_map or parallelMap. Or maybe a more efficient sfLapply call. I'm open to all possibilities.

you could probably simplify your question a lot and attract more interest from the right audience. it seems to be "how to speed up predictions for a svmRadial model" and not at all about raster data (although I would of course be interested to know more about issues you encountered with terra::mask --- perhaps file an issue on github?). — Robert Hijmans
surely predict is part of the bottleneck, but the problem here is bigger and closely related to the size of my data. the single prediction is not a problem, but the whole cycle instead it is very long. I know that predict can be speeded up with clusterR for example, but is it advisable to use beginCluster within sfLapply or in general within a parallel version of the apply function? I just wanted to know if there are better ways to mask and predict large raster data in parallel. as soon as I have time I open an issue on Github for my problem with terra — Elia
My understanding or your question is that is about speeding up model::predict (that you can speed up and then supply to raster::predict) but yes there are other ways. A minimal example would be helpful. — Robert Hijmans
My question is how to change the function mypred in the example to speed up the loop or how to change the loop itself to go faster. Since my function mask and applies a predictive model, it seemed appropriate to specify it in the title. However, the example in the question should be reproducible. — Elia
Those are two questions, I have tried to answer the first one, although not for your specific model — Robert Hijmans

Robert Hijmans Robert Hijmans · Accepted Answer · 2021-04-25T19:07:46

Example data from ?terra::predict

library(terra)
logo <- rast(system.file("ex/logo.tif", package="terra"))   
names(logo) <- c("red", "green", "blue")
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85, 66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31, 22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)
a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9, 99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21, 37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)
xy <- rbind(cbind(1, p), cbind(0, a))
e <- extract(logo, xy[,2:3])
v <- data.frame(cbind(pa=xy[,1], e[,-1]))

build a model

library(randomForest)
model <- randomForest(formula=pa~., data=v)

Here are four ways to use the model to make a spatial prediction

Standard (not parallel)

r1 <- predict(logo, model)

Use build-in parallelization

r2 <- predict(logo, model, cores=2)

Write a parallel predict function

rfun <- function(mod, dat, ...) {
    ncls <- length(cls)
    nr <- nrow(dat)
    s <- split(dat, rep(1:ncls, each=ceiling(nr/ncls), length.out=nr))
    unlist(  parallel::clusterApply(cls, s, function(x, ...) predict(mod, x, ...))  )
}

library(parallel)
cls <- parallel::makeCluster(2)
parallel::clusterExport(cls, c("model", "rfun", "randomForest"))
r3 <- predict(logo, model, fun=rfun)
parallel::stopCluster(cls)

With the raster package you can also use clusterR

library(raster)
beginCluster(2)
b <- brick(logo)
r4 <- clusterR(b, predict, args = list(x=b, model=model))
endCluster()

There is no point in timing these approaches for this toy example. But terra::predict and raster::predict are not very costly, they just read and write chunks of data.

So to speed things up, I would focus on ways to make model::predict (replace with your model) go faster. One way is to write a parallel version and provide that as the fun argument to raster::predict or terra::predict. There can also be model specific things you can do. For example, if you can make a RandomForest model with 200 trees of the same quality as a model with 500 trees, then use the former to predict faster. You can test their performance by letting them make a million predictions or so; that is, independently of the raster data handling aspect.

If you have access to a HPC you can let each node do one of your forest classes.

Best approach for masking and predicting rasters (stratified model): maximise performance

1 Answers