1
votes

I have following three classes and I am getting

Task not serialized

errors. Full stacktrace see below.

First class is a serialized Person:

public class Person implements Serializable 
  {
      private String name;
      private int age;

      public String getName() 
      {
        return name;
      }

      public void setAge(int age) 
      {
        this.age = age;
      }
    }

This class reads from the text file and maps to the person class:

public class sqlTestv2 implements Serializable

{

    private int appID;
    private int dataID;
    private JavaSparkContext sc;



    public JavaRDD<Person> getDataRDD()
    {

    JavaRDD<String> test = sc.textFile("hdfs://localhost:8020/user/cloudera/people.txt");

         JavaRDD<Person> people = test.map(
                    new Function<String, Person>() {
                    public Person call(String line) throws Exception {
                      String[] parts = line.split(",");

                      Person person = new Person();
                      person.setName(parts[0]);
                      person.setAge(Integer.parseInt(parts[1].trim()));

                      return person;
                    }
                  });


        return people;



    }

}

And this retrieves the RDD and performs operations on it:

public class sqlTestv1 implements Serializable

{

    public static void main(String[] arg) throws Exception 
    {
          SparkConf conf = new SparkConf().setMaster("local").setAppName("wordCount");
          JavaSparkContext sc = new JavaSparkContext(conf);
          SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
          sqlTestv2 v2=new sqlTestv2(1,1,sc);
          JavaRDD<Person> test=v2.getDataRDD();


          DataFrame schemaPeople = sqlContext.createDataFrame(test, Person.class);
          schemaPeople.registerTempTable("people");

          DataFrame df = sqlContext.sql("SELECT age FROM people");

          df.show();





    }

}

Full error:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1893) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.map(RDD.scala:293) at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:90) at org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:47) at com.oreilly.learningsparkexamples.mini.java.sqlTestv2.getDataRDD(sqlTestv2.java:54) at com.oreilly.learningsparkexamples.mini.java.sqlTestv1.main(sqlTestv1.java:41) Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext Serialization stack: - object not serializable (class: org.apache.spark.api.java.JavaSparkContext, value: org.apache.spark.api.java.JavaSparkContext@3c3b144b) - field (class: com.oreilly.learningsparkexamples.mini.java.sqlTestv2, name: sc, type: class org.apache.spark.api.java.JavaSparkContext) - object (class com.oreilly.learningsparkexamples.mini.java.sqlTestv2, com.oreilly.learningsparkexamples.mini.java.sqlTestv2@3752fdda) - field (class: com.oreilly.learningsparkexamples.mini.java.sqlTestv2$1, name: this$0, type: class com.oreilly.learningsparkexamples.mini.java.sqlTestv2) - object (class com.oreilly.learningsparkexamples.mini.java.sqlTestv2$1, com.oreilly.learningsparkexamples.mini.java.sqlTestv2$1@70c171ec) - field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function) - object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)

2

2 Answers

3
votes

The stack tells you the answer. It's the JavaSparkContext that you're passing to sqlTestv2. You should pass the sc into the method, not the class

1
votes

You can add the "transient" modifier to sc, so that it is not serialized.