7
votes

I'd like to pass an object from the driver node to other nodes where an RDD resides, so that each partition of the RDD can access that object, as shown in the following snippet.

object HelloSpark {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf()
                .setAppName("Testing HelloSpark")
                .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                .set("spark.kryo.registrator", "xt.HelloKryoRegistrator")

        val sc = new SparkContext(conf)
        val rdd = sc.parallelize(1 to 20, 4)
        val bytes = new ImmutableBytesWritable(Bytes.toBytes("This is a test"))

        rdd.map(x => x.toString + "-" + Bytes.toString(bytes.get) + " !")
            .collect()
            .foreach(println)

        sc.stop
    }
}

// My registrator
class HelloKryoRegistrator extends KryoRegistrator {
    override def registerClasses(kryo: Kryo) = {
        kryo.register(classOf[ImmutableBytesWritable], new HelloSerializer())
    }
}

//My serializer 
class HelloSerializer extends Serializer[ImmutableBytesWritable] {
    override def write(kryo: Kryo, output: Output, obj: ImmutableBytesWritable): Unit = {
        output.writeInt(obj.getLength)
        output.writeInt(obj.getOffset)
        output.writeBytes(obj.get(), obj.getOffset, obj.getLength)
    }

    override def read(kryo: Kryo, input: Input, t: Class[ImmutableBytesWritable]): ImmutableBytesWritable = {
        val length = input.readInt()
        val offset = input.readInt()
        val bytes  = new Array[Byte](length)
        input.read(bytes, offset, length)

        new ImmutableBytesWritable(bytes)
    }
}

In the snippet above, I tried to serialize ImmutableBytesWritable by Kryo in Spark, so I did the follwing:

  1. configure the SparkConf instance passed to spark context, i.e., set "spark.serializer" to "org.apache.spark.serializer.KryoSerializer" and set "spark.kryo.registrator" to "xt.HelloKryoRegistrator";
  2. Write a custom Kryo registrator class in which I register the class ImmutableBytesWritable;
  3. Write a serializer for ImmutableBytesWritable

However, when I submit my Spark application in yarn-client mode, the following exception was thrown:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.map(RDD.scala:270) at xt.HelloSpark$.main(HelloSpark.scala:23) at xt.HelloSpark.main(HelloSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:325) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.NotSerializableException: org.apache.hadoop.hbase.io.ImmutableBytesWritable at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) ... 12 more

It seems that ImmutableBytesWritable can't be serialized by Kryo. So what is the correct way to let Spark serialize an object using Kryo? Can Kryo serialize any type?

1
The same is happening to me, even with a much more straightforward configuration (just setting the serializer config and registering classes). Do note this line of your stack: org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73), for some reason, it is trying to use Java serialization even when you told him not to.Daniel Langdon
Did you manage to resolve this? I'm having the same problem.Nilesh

1 Answers

1
votes

This is happening because you're using ImmutableBytesWritable in your closure. Spark doesn't support closure serialization with Kryo yet (only objects in RDDs). You can take the help of this to solve your problem:

Spark - Task not serializable: How to work with complex map closures that call outside classes/objects?

You simply need to serialize the objects before passing through the closure, and de-serialize afterwards. This approach just works, even if your classes aren't Serializable, because it uses Kryo behind the scenes. All you need is some curry. ;)

Here's an example sketch:

def genMapper(kryoWrapper: KryoSerializationWrapper[(Foo => Bar)])
               (foo: Foo) : Bar = {
    kryoWrapper.value.apply(foo)
}
val mapper = genMapper(KryoSerializationWrapper(new ImmutableBytesWritable(Bytes.toBytes("This is a test")))) _
rdd.flatMap(mapper).collectAsMap()

object ImmutableBytesWritable(bytes: Bytes) extends (Foo => Bar) {
    def apply(foo: Foo) : Bar = { //This is the real function }
}