I am working in Spark and picking up Scala along the way. I have a question about the RDD api and how the various base RDDs are implemented. Specifically, I ran the following code in spark-shell:
scala> val gspeech_path="/home/myuser/gettysburg.txt"
gspeech_path: String = /home/myuser/gettysburg.txt
scala> val lines=sc.textFile(gspeech_path)
lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7]
at textFile at <console>:29
scala> val pairs = lines.map(x => (x.split(" ")(0), x))
pairs: org.apache.spark.rdd.RDD[(String, String)] =
MapPartitionsRDD[8] at map at <console>:3
scala> val temps:Seq[(String,Seq[Double])]=Seq(("SP",Seq(68,70,75)),
("TR",Seq(87,83,88,84,88)),
("EN",Seq(52,55,58,57.5)),
("ER",Seq(90,91.3,88,91)))
temps: Seq[(String, Seq[Double])] = List((SP,List(68.0, 70.0, 75.0)),
(TR,List(87.0, 83.0, 88.0, 84.0, 88.0)), (EN,List(52.0, 55.0, 58.0,
57.5)), (ER,List(90.0, 91.3, 88.0, 91.0)))
scala> var temps_rdd0=sc.parallelize(temps)
temps_rdd0: org.apache.spark.rdd.RDD[(String, Seq[Double])] =
ParallelCollectionRDD[9] at parallelize at <console>:29
I wanted to investigate a bit more and looked up the API for MapPartitionsRDD and ParallelCollectionRDD expecting that they would be subclasses of the base RDD org.apache.spark.rdd.
However, I couldn't find these classes when I searched the
Spark Scala API (Scaladocs)
I was able to find them only in the Java docs not the Scala docs at spark.apache.org. From what I know of Scala the two languages can intermingle as Spark is written in Java. However, I would appreciate some clarification as to the exact relationship as it pertains to RDDs. So is it the case that we have an abstract Scala RDD reference whose underlying implementation is a Java RDD as per this response :
# Scala abstract RDD = Concrete Java MapPartitionsRDD
org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7]
?
Thanks in advance for your help/explanation.
MapPartitionsRDDis a subclass ofRDD: github.com/apache/spark/blob/master/core/src/main/scala/org/… - Archeg