Following my answered question: R or Python - loop the test data - Prediction validation next 24 hours (96 values each day)
I want to predict the next day using H2o Package. You can find detail explanation for my dataset in the same above link.
The data dimension in H2o is different.
So, after making the prediction, I want to calculate the MAPE
I have to change training and testing data to H2o format
train_h2o <- as.h2o(train_data)
test_h2o <- as.h2o(test_data)
mape_calc <- function(sub_df) {
pred <- predict.glm(glm_model, sub_df)
actual <- sub_df$Ptot
mape <- 100 * mean(abs((actual - pred)/actual))
new_df <- data.frame(date = sub_df$date[[1]], mape = mape)
return(new_df)
}
# LIST OF ONE-ROW DATAFRAMES
df_list <- by(test_data, test_data$date, map_calc)
# FINAL DATAFRAME
final_df <- do.call(rbind, df_list)
The upper code works well for "Non-H2o" prediction validation for the day-ahead and it calculates the MAPE for every day.
I tried to convert the H2o predicted model to normal format but according to to:https://stackoverflow.com/a/39221269/9341589, it is not possible.
To make a prediction in H2O:
for instance, let say we want to create a Random Forest Model
y <- "RealPtot" #target
x <- names(train_h2o) %>% setdiff(y) #features
rforest.model <- h2o.randomForest(y=y, x=x, training_frame = train_h2o, ntrees = 2000, mtries = 3, max_depth = 4, seed = 1122)
Then we can get the prediction for complete dataset as shown below.
predict.rforest <- as.data.frame(h2o.predict(rforest.model, test_h2o)
But in my case I am trying to get one-day prediction using mape_calc
NOTE: Any thoughts in R or Python will be appreciated.
UPDATE2(reproducible example):** Following @Darren Cook steps:
I provided a simpler example - Boston housing dataset.
library(tidyverse)
library(h2o)
h2o.init(ip="localhost",port=54322,max_mem_size = "128g")
data(Boston, package = "MASS")
names(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis" "rad" "tax" "ptratio"
[12] "black" "lstat" "medv"
set.seed(4984)
#Added 15 minute Time and date interval
Boston$date<- seq(as.POSIXct("01-09-2017 03:00", format = "%d-%m-%Y %H:%M",tz=""), by = "15 min", length = 506)
#select first 333 values to be trained and the rest to be test data
train = Boston[1:333,]
test = Boston[334:506,]
#Dropped the date and time
train_data_finialized <- subset(train, select=-c(date))
test_data_finialized <- test
#Converted the dataset to h2o object.
train_h2o<- as.h2o(train_data_finialized)
#test_h2o<- as.h2o(test)
#Select the target and feature variables for h2o model
y <- "medv" #target
x <- names(train_data_finialized) %>% setdiff(y) #feature variables
# Number of CV folds (to generate level-one data for stacking)
nfolds <- 5
#Replaced RF model by GBM because GBM run faster
# Train & Cross-validate a GBM
my_gbm <- h2o.gbm(x = x,
y = y,
training_frame = train_h2o,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
seed = 1)
mape_calc <- function(sub_df) {
p <- h2o.predict(my_gbm, as.h2o(sub_df))
pred <- as.vector(p)
actual <- sub_df$medv
mape <- 100 * mean(abs((actual - pred)/actual))
new_df <- data.frame(date = sub_df$date[[1]], mape = mape)
return(new_df)
}
# LIST OF ONE-ROW DATAFRAMES
df_list <- by(test_data_finialized, test_data_finialized$date, mape_calc)
final_df <- do.call(rbind, df_list)
This is the error I am getting now:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Provided column type POSIXct is unknown. Cannot proceed with parse due to invalid argument.