1
votes

I'm trying to import Glove to h2o cluster via R with word2vec function. Regarding to this Does or will H2O provide any pretrained vectors for use with h2o word2vec? I downloaded pretrained glove.840B.300d.txt file and tried to import it to h2o but there was problem with parsing. Then I read Glove to R, removed one line recognized as a NA and saved it as csv. With the csv file parsing in h2o went well but I couldn't create word2vec model with it hence it threw java.lang.NullPointerException

I have h2o_3.15.0.99999 version.

My code:

h2o.init()
glove<-h2o.importFile("glove.840B.300d.csv",header = F)
model<-h2o.word2vec(pre_trained = glove,vec_size = 300)

Full output:

|==========================================================================| 100%

java.lang.NullPointerException
java.lang.NullPointerException
at water.AutoBuffer.tcpOpen(AutoBuffer.java:488)
at water.AutoBuffer.sendPartial(AutoBuffer.java:679)
at water.AutoBuffer.putA4f(AutoBuffer.java:1383)
at hex.word2vec.Word2VecModel$Word2VecOutput$Icer.write90(Word2VecModel$Word2VecOutput$Icer.java)
at hex.word2vec.Word2VecModel$Word2VecOutput$Icer.write(Word2VecModel$Word2VecOutput$Icer.java)
at water.Iced.write(Iced.java:61)
at water.AutoBuffer.put(AutoBuffer.java:771)
at hex.Model$Icer.write86(Model$Icer.java)
at hex.word2vec.Word2VecModel$Icer.write85(Word2VecModel$Icer.java)
at hex.word2vec.Word2VecModel$Icer.write(Word2VecModel$Icer.java)
at water.Iced.write(Iced.java:61)
at water.Iced.asBytes(Iced.java:42)
at water.Value.<init>(Value.java:348)
at water.TAtomic.atomic(TAtomic.java:22)
at water.Atomic.compute2(Atomic.java:56)
at water.Atomic.fork(Atomic.java:39)
at water.Atomic.invoke(Atomic.java:31)
at water.Lockable.unlock(Lockable.java:181)
at water.Lockable.unlock(Lockable.java:176)
at hex.word2vec.Word2Vec$Word2VecDriver.computeImpl(Word2Vec.java:72)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:205)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
2

2 Answers

1
votes

Thanks for the report, the current implementation is restricted by JVM's maximum length of an array. This model seems to be too large and it exceeds the JVM's limits.

We will have to fix it in H2O.

1
votes

Given that you are exceeding the max array size for a model, as a workaround, you could trim it back a bit. I'm assuming the vocabulary is ordered by frequency, in the glove file. In other words, I am assuming the most frequent words come first, and that the ones at the end are generally obscure and less useful.

E.g. this code would just use the first 50% of the words.

h2o.init()
glove <- h2o.importFile("glove.840B.300d.csv",header = F)
parts <- h2o.split(glove, [0.5])
modelCommon <- h2o.word2vec(pre_trained = parts[[1]],vec_size = 300)

Depending on what you were going to do next, you could make a 2nd model for the second half of the data:

modelObscure <- h2o.word2vec(pre_trained = parts[[2]],vec_size = 300)