8
votes

Following my answered question: R or Python - loop the test data - Prediction validation next 24 hours (96 values each day)

I want to predict the next day using H2o Package. You can find detail explanation for my dataset in the same above link.

The data dimension in H2o is different.

So, after making the prediction, I want to calculate the MAPE

I have to change training and testing data to H2o format

train_h2o <- as.h2o(train_data)

test_h2o <- as.h2o(test_data)

mape_calc <- function(sub_df) {
  pred <- predict.glm(glm_model, sub_df)
  actual <- sub_df$Ptot
  mape <- 100 * mean(abs((actual - pred)/actual))

  new_df <- data.frame(date = sub_df$date[[1]], mape = mape)

  return(new_df)
}

# LIST OF ONE-ROW DATAFRAMES
df_list <- by(test_data, test_data$date, map_calc)

# FINAL DATAFRAME
final_df <- do.call(rbind, df_list)

The upper code works well for "Non-H2o" prediction validation for the day-ahead and it calculates the MAPE for every day.

I tried to convert the H2o predicted model to normal format but according to to:https://stackoverflow.com/a/39221269/9341589, it is not possible.

To make a prediction in H2O:

for instance, let say we want to create a Random Forest Model

y <- "RealPtot" #target
x <- names(train_h2o) %>% setdiff(y) #features


rforest.model <- h2o.randomForest(y=y, x=x, training_frame = train_h2o, ntrees = 2000, mtries = 3, max_depth = 4, seed = 1122)

Then we can get the prediction for complete dataset as shown below.

predict.rforest <- as.data.frame(h2o.predict(rforest.model, test_h2o)

But in my case I am trying to get one-day prediction using mape_calc


NOTE: Any thoughts in R or Python will be appreciated.

UPDATE2(reproducible example):** Following @Darren Cook steps:

I provided a simpler example - Boston housing dataset.

library(tidyverse)
library(h2o)
h2o.init(ip="localhost",port=54322,max_mem_size = "128g")


data(Boston, package = "MASS")

names(Boston)
[1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"     "rad"     "tax"     "ptratio"
[12] "black"   "lstat"   "medv"   


set.seed(4984)
#Added 15 minute Time and date interval 
Boston$date<- seq(as.POSIXct("01-09-2017 03:00", format = "%d-%m-%Y %H:%M",tz=""), by = "15 min", length = 506)

#select first 333 values to be trained and the rest to be test data
train = Boston[1:333,]
test = Boston[334:506,]

#Dropped the date and time
train_data_finialized  <- subset(train, select=-c(date))

test_data_finialized <- test

#Converted the dataset to h2o object.
train_h2o<- as.h2o(train_data_finialized)
#test_h2o<- as.h2o(test)

#Select the target and feature variables for h2o model
y <- "medv" #target
x <- names(train_data_finialized) %>% setdiff(y) #feature variables

# Number of CV folds (to generate level-one data for stacking)
nfolds <- 5

#Replaced RF model by GBM because GBM run faster
# Train & Cross-validate a GBM
my_gbm <- h2o.gbm(x = x,
                  y = y,
                          training_frame = train_h2o,
                          nfolds = nfolds,
                          fold_assignment = "Modulo",
                          keep_cross_validation_predictions = TRUE,
                          seed = 1)

mape_calc <- function(sub_df) {
  p <- h2o.predict(my_gbm, as.h2o(sub_df))
  pred <- as.vector(p)
  actual <- sub_df$medv
  mape <- 100 * mean(abs((actual - pred)/actual))
  new_df <- data.frame(date = sub_df$date[[1]], mape = mape)
  return(new_df)
}


# LIST OF ONE-ROW DATAFRAMES
df_list <- by(test_data_finialized, test_data_finialized$date, mape_calc)

final_df <- do.call(rbind, df_list)

This is the error I am getting now:

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :

ERROR MESSAGE:

Provided column type POSIXct is unknown. Cannot proceed with parse due to invalid argument.

3
Please post a reproducible example: stackoverflow.com/help/mcve No one can run this code w/o changing it since you didn't use a public dataset.Erin LeDell

3 Answers

10
votes

H2O is running in a separate process to R (whether H2O is on the local server or in a distant data centre). The H2O data and the H2O models are kept in that H2O process, and cannot be seen by R.

What dH <- as.h2o(dR) does is copy an R data frame, dR, into H2O's memory space. The dH is then an R variable that describes the H2O data frame. I.e. it is a pointer, or a handle; it is not the data itself.

What dR <- as.data.frame(dH) does is copy the data from the H2O process's memory, into the R process's memory. (as.vector(dH) does the same for when dH describes a single column)

So, the simplest way to modify your mape_calc(), assuming that sub_df is an R data frame, is to change the first two lines as follows:

mape_calc <- function(sub_df) {
  p <- h2o.predict(rforest.model, as.h2o(sub_df))
  pred <- as.vector(p)

  actual <- sub_df$Ptot
  mape <- 100 * mean(abs((actual - pred)/actual))

  new_df <- data.frame(date = sub_df$date[[1]], mape = mape)

  return(new_df)
}

I.e. upload sub_df to H2O, and give that to h2o.predict(). Then use as.vector() to download the prediction that was made.

This was relative to your original code. So keep the original version of this:

# LIST OF ONE-ROW DATAFRAMES
df_list <- by(test_data, test_data$date, map_calc)

I.e. don't use by() directly on test_h2o.


UPDATE based on edited question:

I made two changes to your example code. First, I removed the date column from sub_df. That was what was causing the error message.

The second change was just to simplify the return type; not important, but you ended up with the date column duplicated, before.

mape_calc <- function(sub_df) {
  sub_df_minus_date <- subset(sub_df, select=-c(date))
  p <- h2o.predict(my_gbm, as.h2o(sub_df_minus_date))
  pred <- as.vector(p)
  actual <- sub_df$medv
  mape <- 100 * mean(abs((actual - pred)/actual))
  data.frame(mape = mape)
}

ASIDE: h2o.predict() is most efficient when working on a batch of data to make predictions on. Putting h2o.predict() inside a loop is a code smell. You would be better to call h2o.predict(rforest.model, test_h2o) once, outside the loop, then download the predictions into R, and cbind them to test_data, and then use by on that combined data.

UPDATE Here is your example changed to work that way: (I've added the prediction as an extra column to the test data; there are other ways to do it, of course)

 test_h2o <- as.h2o(subset(test_data_finialized, select=-c(date)))
 p <- h2o.predict(my_gbm, test_h2o)
 test_data_finialized$pred = as.vector(p)

 mape_calc2 <- function(sub_df) {
   actual <- sub_df$medv
   mape <- 100 * mean(abs((actual - sub_df$pred)/actual))
   data.frame(mape = mape)
 }

 df_list <- by(test_data_finialized, test_data_finialized$date, mape_calc2)

You should notice that it runs much quicker.

ADDITIONAL UPDATE: by() works by grouping same values of your 2nd argument, and processing them together. As all your timestamps are different, you are processing one row at a time.

Look into the xts library, and e.g. apply.daily() to group timestamps. But for the simple case of wanting to process by date, there is a simple hack. Change your by() line to:

df_list <- by(test_data_finialized, as.Date(test_data_finialized$date), mape_calc2)

Using as.Date() will strip off the times. Therefore all the rows on the same day now look the same and get processed together.

ASIDE 2: You would get better responses if your make the infamous minimal example. Then people can run your code, and they can test their answers. It is also often better to use a simple data set everyone has, e.g. iris, rather than your own data. (You can do regression on any of the first 4 fields; using iris does not have to always be about predicting the species.)

ASIDE 3: You can do MAPE completely inside H2O, as the abs() and mean() functions will work directly on H2O data frames (as do lots of other things - see the H2O manual): https://stackoverflow.com/a/43103229/841830 (I'm not marking this as a duplicate, as your question was how to adapt by() for use with H2O data frames, not how to calculate MAPE efficiently!)

5
votes

It looks like you are mixing up R and H2O data types. Remember H2O's R is simply an R API and is not the same as native R. This means that you can't apply an R function that expects an R dataframe to an H2OFrame. And likewise you can't apply an H2O Function to an R dataframe when it expects an H2OFrame.

As you can see from the R docs on by it's a function that expects "an R object, normally a data frame, possibly a matrix" so you can't pass in an H2O frame.

Similarly you are passing date = H2OFrame to data.frame().

However you can use the as.data.frame() to convert an H2OFrame to an R dataframe and then go about your calculations entirely in R.

0
votes

Could it simply be the file format that is the problem? I got "Provided column type POSIXct is unknown" after I imported from Excel and ran:

hr_data_h2o <- as.h2o(hr_data)
split_h2o <- h2o.splitFrame(hr_data_h2o, c(0.7, 0.15), seed = 1234)

I changed the source file to tab delimited (no other changes) and the problem went away.