0
votes

I have a 2043-hourly time series. My aim is to train and test a forecasting model through Gaussian Processes Regression (GPR) model, including its uncertainty estimation. I've got good results on punctual predictions applying a joint of R packages caret + kernlab::train( ) model. How could I estimate its prediction interval using these R packages?

My .csv data are available via three Dropbox links:

Training data

Testing data

Complete Time series data.frame (Time, train and test data)

The aim is to predict "ST4_lead1" using the "ST3" and "ST4_lag1" regressors. They're our columns on the data.frames.

From now on, I'll present my code. This first window code is common to both of us. You may fully run it.

# Forecasting dataseries: caret + kernlab packages
################ Load R packages
library(readr)
library(tidyverse)
library(caret)
library(kernlab)
library(hydroGOF)
library(lubridate)


################ Load data
setwd("C:/...") # Set the folder you saved the downloaded .csv data

data_train_emd <- read.csv("data_train_emd.csv", sep = ';')
data_test_emd <- read.csv("data_test_emd.csv", sep = ';')
final_df_emd <- read.csv("final_df_emd.csv", sep = ';')


################ GaussprRadial model Preprocess
set.seed(111)
cvCtrl <- trainControl(
  method = "repeatedcv",    # Cross-correlation preprocess
  repeats = 1,
  number = 3,
  allowParallel = TRUE,
  verboseIter = TRUE,
  savePredictions = "final")


################ Set gaussprRadial grid for tuning parameters
grid_gaussprRadial <- expand.grid(sigma = c(1*1e-3, 5*1e-3))


################ Implement Gaussian Process
################ Train
set.seed(111)
attach(data_train_emd)
system.time(Fitting_model <- train(ST4_lead1 ~ ST3 + ST4_lag1,      # Predict "ST4_lead1" based on regressors "ST3" and "ST4_lag1"
                              method ="gaussprRadial",              # Radial Basis kernel function
                              trControl = cvCtrl,
                              data = data_train_emd,
                              metric = "MAE",                       # Using "MAE" as fitting parameters, since I'm focusing on low data
                              preProcess = c("center", "scale"),    # Centering and scaling data Preprocess
                              tuneGrid = grid_gaussprRadial,
                              maxit = 1000,
                              linout = 1))                          # 1 for regression model

#The training process lasts 14 seconds on my machine.



################ Test
attach(final_df_emd)
forecasted <- data_test_emd %>%                                           
  mutate(Qpred = predict(Fitting_model, newdata = data_test_emd),              
         Time = final_df_emd %>% filter(Type == "Test") %>% pull(Time))   


################ Tidy training results
################ View metrics for training process
results_gaussprRadial <- Fitting_model$resample %>% mutate(Model = "")


colnames(results_gaussprRadial) = c("RMSE", "R2", "MAE", "Sigma", "Model")
results_gaussprRadial %>% 
  select(RMSE, R2, MAE, Model) %>% 
  gather(a, b, -Model) %>% 
  ggplot(aes(Model, b, color = Model, fill = Model)) + 
  geom_boxplot(alpha = 0.3, show.legend = FALSE) + 
  facet_wrap(~ a, scales = "free") + 
  labs(x = "Train simulation performances", y = NULL)


QObs_Tr = Fitting_model$pred$obs # Observed data to be trained
Qsim = Fitting_model$pred$pred   # Simulated data in the training process
Residual_Tr = QObs_Tr - Qsim 
train_obs_calc <- data.frame(QObs_Tr, Qsim, Residual_Tr)

ggplot(data = train_obs_calc, aes(x = QObs_Tr, y = Qsim)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "black") +
  labs(title = "Error Scatterplot", x = "Q(m³/s)", y = "Q(m³/s)")

We have done the common training and testing processes as well as obtaining the testing results.

In the second window code we'll extract some testing results from forecasted object. By the way, I expect that forecasted will bring us the desired prediction intervals, which I need assistance.

################ Tidy testing results
################ Set time period
N.hours <- as.numeric(ymd_h("2016-10-23 10") - ymd_h("2016-10-02 05"))*24 # Set date limits
Time = ymd_h("2016-10-02 05") + hours(0:N.hours)


################ View testing process results
QObs_Test = forecasted$ST4_lead1 # Observed data to be tested
QPred = forecasted$Qpred   # Simulated data in the testing process
Residual_Test = QObs_Test - QPred 
test_obs_calc <- data.frame(Time, QObs_Test, QPred, Residual_Test)

ggplot(data = test_obs_calc, aes(x = Time, y = QObs_Test)) +
  geom_line() +
  geom_point(aes(y = QPred), color = "coral") +
  labs(title = "Testing Results", x = "", y = "Q(m³/s)")

It brings us the testing results plot, as below. The continuous line is the observed data, while the red dots are the predicted estimates.

enter image description here

Lastly, in this third window code, we just compute the metrics for the regression:

################ Metrics performance
################ Train
gof(QObs_Tr, Qsim)

################ Test
gof(QObs_Test, QPred)

As we see, for training process, R² = 0.76 and MAE = 1.06; whereas for testing counterpart, R² = 0.80 and MAE = 1.36. These are good enough for us.

In GPR, the prediction invervals are based on a gaussian distribution around each prediction point, as provided by Rasmussen and Williams (2006). I had tried the GPR through three other ways: using just kernlab::train( ), tgp::bgp( ) and gplmr::train( ). This last one was graciously suggested by duckmayr. For all of them I found the prediction intervals. However, the joint use of "caret + kernlab::train( )" (as showed in my code) returned much better punctual predictions. I don't know the reason why.

As I mentioned between the 1st and the 2nd window code, I may expect that from "forecasted" object, we may extract the average data from each prediction point and add (or subtract) 2 times its standard deviation, allowing do y = average +- 2*sd. However, I don't know how to extract this average, or even the standard deviation on each predicted point.

In short, how do I get the prediction interval for Gaussian Processes Regression using the joint use of caret + kernlab packages?

1

1 Answers

0
votes

Have you tried typing "forecasted$" and look if these averages and sd appear?