3
votes

When I try to convert from spark dataframe to H2O data frame I get the error below. This seems to have to do with the size of the dataframe because when I make it smaller the converter between spark and H2O works well.

Are there any configurations that need to be changed in order to convert large spark dataframes to H2O using sparkling water? In my configuration I am allowing max memory to the driver and executor so this is not a memory issue.

I am using R here the code is:

training<-as_h2o_frame(sc, final1, strict_version_check = FALSE)

Error:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 95.1 failed 4 times, most recent failure: Lost task 4.3 in stage 95.1 (TID 4050, 10.0.0.9): java.lang.ArrayIndexOutOfBoundsException: 65535
                at water.DKV.get(DKV.java:202)
                at water.DKV.get(DKV.java:175)
                at water.Key.get(Key.java:83)
                at water.fvec.Frame.createNewChunks(Frame.java:896)
                at water.fvec.FrameUtils$class.createNewChunks(FrameUtils.scala:43)
                at water.fvec.FrameUtils$.createNewChunks(FrameUtils.scala:70)
                at org.apache.spark.h2o.backends.internal.InternalWriteConverterCtx.createChunks(InternalWriteConverterCtx.scala:29)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$.org$apache$spark$h2o$converters$SparkDataFrameConverter$$perSQLPartition(SparkDataFrameConverter.scala:95)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:74)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:74)
                at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
                at org.apache.spark.scheduler.Task.run(Task.scala:86)
                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
                at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
                at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
                at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
                at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
                at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
                at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
                at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
                at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
                at scala.Option.foreach(Option.scala:257)
                at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
                at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
                at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
                at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
                at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
                at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
                at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
                at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
                at org.apache.spark.SparkContext.runJob(SparkContext.scala:1906)
                at org.apache.spark.h2o.converters.WriteConverterCtxUtils$.convert(WriteConverterCtxUtils.scala:83)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$.toH2OFrame(SparkDataFrameConverter.scala:74)
                at org.apache.spark.h2o.H2OContext.asH2OFrame(H2OContext.scala:145)
                at org.apache.spark.h2o.H2OContext.asH2OFrame(H2OContext.scala:143)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:498)
                at sparklyr.Invoke$.invoke(invoke.scala:102)
                at sparklyr.StreamHandler$.handleMethodCall(stream.scala:89)
                at sparklyr.StreamHandler$.read(stream.scala:54)
                at sparklyr.BackendHandler.channelRead0(handler.scala:49)
                at sparklyr.BackendHandler.channelRead0(handler.scala:14)
                at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
                at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
                at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
                at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
                at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
                at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
                at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
                at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
                at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
                at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
                at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
                at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
                at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
                at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
                at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
                at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
                at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
                at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 65535
                at water.DKV.get(DKV.java:202)
                at water.DKV.get(DKV.java:175)
                at water.Key.get(Key.java:83)
                at water.fvec.Frame.createNewChunks(Frame.java:896)
                at water.fvec.FrameUtils$class.createNewChunks(FrameUtils.scala:43)
                at water.fvec.FrameUtils$.createNewChunks(FrameUtils.scala:70)
                at org.apache.spark.h2o.backends.internal.InternalWriteConverterCtx.createChunks(InternalWriteConverterCtx.scala:29)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$.org$apache$spark$h2o$converters$SparkDataFrameConverter$$perSQLPartition(SparkDataFrameConverter.scala:95)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:74)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:74)
                at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
                at org.apache.spark.scheduler.Task.run(Task.scala:86)
                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                ... 1 more
1
It seems like your H2O cloud is not properly initialized. Please check the readme here github.com/h2oai/rsparkling#spark-connectionJakub Háva

1 Answers

0
votes

going to repost Jakub's comment so it is more easily found:

It seems like your H2O cloud is not properly initialized. Please check the readme here github.com/h2oai/rsparkling#spark-connection