java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy

Question

I processing the csv file from Spring Batch java application to spark cleaning.
cleaned files writing to parquet in cluster.
getting Serialization exception.

Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2287)

Welcome to StackOverflow! This ain't a nice way to ask questions. I think you should have a look at the help me section first. stackoverflow.com/help/asking — obscure

OzzMaster OzzMaster · Accepted Answer · 2019-05-30T07:45:38

I have the same error message when running this code:

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.sql.functions.udf
import scala.collection.mutable.WrappedArray

val txt = Array("A B B C", "A B D D", "A C D")
val txtDf     = spark.sparkContext.parallelize(txt).toDF("txt")
val txtDfSplit = txtDf.withColumn("txt", split(col("txt"), " "))

// val txtDfSplit = df.withColumn("txt", split(col("txt"), " "))

// create sparse vector with the number 
// of occurrences of each word using CountVectorizer
val cvModel = new CountVectorizer().setInputCol("txt").setOutputCol("features").setVocabSize(4).setMinDF(2).fit(txtDfSplit)

val txtDfTrain = cvModel.transform(txtDfSplit)
txtDfTrain.show(false)

Produces this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in stage 1.0 (TID 25, somehostname.domain, executor 1): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

I have been looking through various pages describing this error and it seems to be some kind of version conflict. The code works in IntelliJ (stand alone). I get the error when the app is submitted to Spark.

java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy

1 Answers