Java RDD vs Scala RDD

Question

I am working in Spark and picking up Scala along the way. I have a question about the RDD api and how the various base RDDs are implemented. Specifically, I ran the following code in spark-shell:

scala> val gspeech_path="/home/myuser/gettysburg.txt"
gspeech_path: String = /home/myuser/gettysburg.txt

scala> val lines=sc.textFile(gspeech_path)
lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] 
at textFile at <console>:29

scala> val pairs = lines.map(x => (x.split(" ")(0), x))
pairs: org.apache.spark.rdd.RDD[(String, String)] =  
MapPartitionsRDD[8] at map at <console>:3

scala> val temps:Seq[(String,Seq[Double])]=Seq(("SP",Seq(68,70,75)),
                                       ("TR",Seq(87,83,88,84,88)), 
                                       ("EN",Seq(52,55,58,57.5)),
                                       ("ER",Seq(90,91.3,88,91)))

temps: Seq[(String, Seq[Double])] = List((SP,List(68.0, 70.0, 75.0)), 
(TR,List(87.0, 83.0, 88.0, 84.0, 88.0)), (EN,List(52.0, 55.0, 58.0,  
57.5)), (ER,List(90.0, 91.3, 88.0, 91.0)))

scala> var temps_rdd0=sc.parallelize(temps)
temps_rdd0: org.apache.spark.rdd.RDD[(String, Seq[Double])] = 
ParallelCollectionRDD[9] at parallelize at <console>:29

I wanted to investigate a bit more and looked up the API for MapPartitionsRDD and ParallelCollectionRDD expecting that they would be subclasses of the base RDD org.apache.spark.rdd. However, I couldn't find these classes when I searched the Spark Scala API (Scaladocs)

I was able to find them only in the Java docs not the Scala docs at spark.apache.org. From what I know of Scala the two languages can intermingle as Spark is written in Java. However, I would appreciate some clarification as to the exact relationship as it pertains to RDDs. So is it the case that we have an abstract Scala RDD reference whose underlying implementation is a Java RDD as per this response :

# Scala abstract RDD = Concrete Java MapPartitionsRDD
org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7]

?

Thanks in advance for your help/explanation.

Spark is written in scala, and MapPartitionsRDD is a subclass of RDD: github.com/apache/spark/blob/master/core/src/main/scala/org/… — Archeg
I guess what got me confused is that I couldn't find MapPartitionsRDD when I did a search in the Spark Scala API (Scaladoc) - spark.apache.org/docs/latest/api/scala/index.html — femibyte

femibyte femibyte · Accepted Answer · 2016-01-29T15:31:18

As @Archeg pointed out in his comment above, these classes are indeed Scala classes and can be found at org.apache.spark.rdd.MapPartitionsRDD

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala

What caused my confusion was that I couldn't find MapPartitionsRDD when I did a search in the Spark Scala API (Scaladoc)

Java RDD vs Scala RDD

2 Answers