I have the same error message when running this code:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.sql.functions.udf
import scala.collection.mutable.WrappedArray
val txt = Array("A B B C", "A B D D", "A C D")
val txtDf = spark.sparkContext.parallelize(txt).toDF("txt")
val txtDfSplit = txtDf.withColumn("txt", split(col("txt"), " "))
// val txtDfSplit = df.withColumn("txt", split(col("txt"), " "))
// create sparse vector with the number
// of occurrences of each word using CountVectorizer
val cvModel = new CountVectorizer().setInputCol("txt").setOutputCol("features").setVocabSize(4).setMinDF(2).fit(txtDfSplit)
val txtDfTrain = cvModel.transform(txtDfSplit)
txtDfTrain.show(false)
Produces this error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3
in stage 1.0 (TID 25, somehostname.domain, executor 1):
java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of
type scala.collection.Seq in instance of
org.apache.spark.rdd.MapPartitionsRDD
I have been looking through various pages describing this error and it seems to be some kind of version conflict. The code works in IntelliJ (stand alone). I get the error when the app is submitted to Spark.