2
votes

I'm just testing out h2o since I've heard great things about it. So far I've been using the following code:

 library(h2o)

h2o.removeAll() # Clean up. Just in case H2O was already running
h2o.init(nthreads = -1, max_mem_size="22G")  # Start an H2O cluster with all threads available

train <- read.csv("TRAIN")
test <- read.csv("TEST")

target <- as.factor(train$target)

feature_names <- names(train)[1:(ncol(train)-1)]

train_h2o <- as.h2o(train)
test_h2o <- as.h2o(test)

prob <- test[, "id", drop = FALSE]

model_glm <- h2o.glm(x = feature_names,  y = "target", training_frame = train_h2o)
h2o.performance(model_glm) 

pred_glm <- predict(model_glm, newdata = test_h2o)

The relevant part is really that last line, where I got the following error:

DistributedException from localhost/127.0.0.1:54321, caused by java.lang.ArrayIndexOutOfBoundsException

DistributedException from localhost/127.0.0.1:54321, caused by java.lang.ArrayIndexOutOfBoundsException
    at water.MRTask.getResult(MRTask.java:478)
    at water.MRTask.getResult(MRTask.java:486)
    at water.MRTask.doAll(MRTask.java:390)
    at water.MRTask.doAll(MRTask.java:396)
    at hex.glm.GLMModel.predictScoreImpl(GLMModel.java:1198)
    at hex.Model.score(Model.java:1030)
    at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:345)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1241)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.ArrayIndexOutOfBoundsException

Has anyone come across this before? Are there any easy solutions to this that I might be missing? Thanks in advance.

1
Do you have the exact same column names in your Train and Test datasets? If not, you will get an error.PhilC
@PhilC Yep, that was definitely it, small typo in a column name. Should have checked that more carefully before posting.114
Hey @PhilC do you mind writing your response as an Answer so that this question can be closed? If not, I'm happy to, but I thought you may prefer to write it yourself.Erin LeDell
FYI, I just opened a JIRA ticket to create a more informative error message: 0xdata.atlassian.net/browse/PUBDEV-4418Erin LeDell

1 Answers

1
votes

As noted in comments, the column names in the Train and Test datasets need to match exactly or you will get an error message. Glad that you were able to find the issue.