This is a question on performance: The code below works well with the dummy data in the example, but scaling the problem to bigger data became very slow. I'm wondering if there is a better approach to do the same things.
The problem
I want to spatially predict a dependent variable y (forest growing stock volume in my real case study) based on a predictor x (forest age in my case study). I also have a forest types map (18 forest types) and I've built a regression model (SVM with radial kernel) for each forest type. For each forest type, I want to predict y based on x. I have annual data for the predictor x (forest age), so these operations must be done for every year (6 years in the example).
Scalability
My problem is the computational time. My real rasters are quite big, and the analysis has to be repeated for 15 years. Below an example of my real rasters
#forest types map (this is a spatRast object created with terra::separate)
Class: RasterStack
dimensions : 52175, 43480, 2268569000, 18 (nrow, ncol, ncell, nlayers)
resolution : 23, 23 (x, y)
extent : 313260, 1313300, 4020285, 5220310 (xmin, xmax, ymin, ymax)
crs : +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs
names : X224, X324, X3111, X3112, X3113, X3114, X3115, X3116, X3117, X3121, X3122, X3123, X3124, X3125, X3131, ...
min values : 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
max values : 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
#example of one predictor (this is a RasterLayer object, read in R with raster::raster)
class : RasterLayer
dimensions : 52175, 43480, 2268569000 (nrow, ncol, ncell)
resolution : 23, 23 (x, y)
extent : 313260, 1313300, 4020285, 5220310 (xmin, xmax, ymin, ymax)
crs : +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs
source : E:/Elia_backup/FOR-EST/prova/disturbi_anni/2020.tif
names : X2020
values : 1, 35 (min, max)
Here is my current approach (adapted to the example data created ad hoc):
Create example data
library(raster)
library(terra)
library(ggplot2)
library(caret)
library(snowfall)
# generate data --------------------------------------------------------
set.seed(5)
#stack of predictor for 6 consecutive years
r1 <-r2 <- r3 <- r4 <- r5 <- r6 <- m <- raster()
r1[] <-sample(1:16,ncell(r1),replace = T)
r2[] <- sample(1:20,ncell(r1),replace = T)
r3[] <-sample(1:24,ncell(r1),replace = T)
r4[] <-sample(1:28,ncell(r1),replace = T)
r5[] <-sample(1:32,ncell(r1),replace = T)
r6[] <- sample(1:35,ncell(r1),replace = T)
#this stack contain the predictor for each year of the analysis
year_cut <- stack(r1,r2,r3,r4,r5,r6)
#generate the raster mask (simulate land uses)
m[] <- rep(1:10,each=ncell(r1)/10)
#plot
plot(m)
#transform the land uses map in RasterStack with one layer for each land use
m <- separate(rast(m),other=NA)
m <- stack(m)
#rename layers
names(m) <- paste0("type",gsub("X","",names(m)))
#generate data frame of field observation
a <- sample(1:35,500,replace = T)
g <- sample(seq(10,100,10),500,replace = T)
v <- ((a*runif(500)*g)^2+runif(500))/10000
s <- sample(names(m),500,replace = T)
#predictive models will be built form this data frame
df <- data.frame(a,v,s)
#plot
ggplot(df,aes(x=a,y=v,col=s))+
geom_point()+
facet_wrap(~s,scales = "free")
Below I fit one model for each forest type
# models ------------------------------------------------------------------
#function for model fitting based on the land uses
mod_fit <- function(s){
sample <- df[df$s==s,]
fit <- train(v~a,data=sample,method="svmRadial",tuneLength=5)
}
#fit one model for each land use
models <- lapply(names(m),mod_fit)
#rename the list of models to match the name of the land use
names(models) <- names(m)
And finally, mask predictor in turn with the forest types and predict y (forest growing stock volume) with the correspondent model. In this piece of code, I iterate the operations for every year (that is, for every raster predictors; in my example, for every layer of year_cut
)
# mask and predict --------------------------------------------------------
#function for mask and predict
mypred <- function(x){
#land use
type <- names(x)
#mask the predictor (cut, defined below) with the land use in "type"
masked <- raster::mask(cut,x)
#rename for predict
names(masked) <- "a"
#predict v~a based on the masked predictor, with the right model
predicted <- raster::predict(masked,model=models[[which(names(models)==type)]],overwite=T)
}
# loop across "years_cut" ---------------------------------------------------
#here I store my results
tot <- as.list(1:nlayers(year_cut))
for (i in 1:nlayers(year_cut)) {
#read the predictor for the current year
cut <- year_cut[[i]]
#unstack the land use mask
l.list <- unstack(m)
names(l.list) <- names(m)
#set a parallel cluster with as many core as layer in the land use mask
sfInit(parallel = TRUE, cpus =nlayers(m))
# Load the required packages inside the cluster
sfLibrary(raster)
#export all variable in all the cluster
sfExportAll()
# Run parallelized function and stop cluster
system.time(pred <- sfLapply(l.list, fun =mypred))
sfStop()
#finally, summarize the prediction of each year in one single raster
pred_stack <- stack(pred)
#calc in parallel to make the sum of each layer in "pred_stack"
beginCluster(10)
tot[[i]] <-clusterR(pred_stack, calc, args = list(fun = sum, na.rm = T), m = 1)
endCluster()
}#140 sec
This took 140 seconds on my PC, but on the real data, it took over 2 days to run.
I've tried to substitute the raster
functions with the terra
equivalent but, strangely, with terra
I encounter memory issues, especially in the mask
and predict
part.
I know that the predict
part is a typical Map
problem. I'm wondering if I can substitute sfLapply
with a parallel Map
equivalent, like future_map
or parallelMap
. Or maybe a more efficient sfLapply
call. I'm open to all possibilities.
predict
is part of the bottleneck, but the problem here is bigger and closely related to the size of my data. the single prediction is not a problem, but the whole cycle instead it is very long. I know thatpredict
can be speeded up withclusterR
for example, but is it advisable to usebeginCluster
withinsfLapply
or in general within a parallel version of the apply function? I just wanted to know if there are better ways to mask and predict large raster data in parallel. as soon as I have time I open an issue on Github for my problem withterra
– Eliamypred
in the example to speed up the loop or how to change the loop itself to go faster. Since my function mask and applies a predictive model, it seemed appropriate to specify it in the title. However, the example in the question should be reproducible. – Elia