Why are there different RDDs and what are their respective purposes?

Question

There are a lot of RDDs in Spark; from the docs:

AsyncRDDActions
CoGroupedRDD
DoubleRDDFunctions
HadoopRDD
JdbcRDD
NewHadoopRDD
OrderedRDDFunctions
PairRDDFunctions
PartitionPruningRDD
RDD
SequenceFileRDDFunctions
ShuffledRDD
UnionRDD

and I do not understand what they are supposed to be.

Additionally I noticed that there are

ParallelCollectionRDD
MapPartitionsRDD

which are not listed though they appear very often in my spark-shell as objects.

Question

Why are there different RDDs and what are their respective purposes?

What I understood so far

I understood from tutorials and books (e.g. "Learning Spark") that there are two types of operations on RDDs: Those for RDDs which have pairs (x, y) and all the other operations. So I would expect to have class RDD and PairRDD and that's it.

What I suspect

I suspect that I got it partly wrong and what is actually the case is that a lot of RDD classes could be just one RDD class - but that would make things less tidy. So instead, the developers decided to put different methods into different classes and in order to provide those to any RDD class type, they use implicit to coerce between the class types. I suspect that due to the fact that many of the RDD class types end with "Functions" or "Actions" and text in the respective scaladocs sound like this.

Additionally I suspect that some of the RDD classes still are not like that, but have some more in-depth meaning (e.g. ShuffledRDD).

However - I am not sure about any of this.

zero323 zero323 · Accepted Answer · 2016-06-03T13:56:39

First of all roughly a half of the listed classes don't extend RDD but are type classes designed to augment RDD with different methods specific to the stored type.

One common example is RDD[(T, U)], commonly known as PairRDD, which is enriched by methods provided by PairRDDFunctions like combineByKeyWithClassTag which is a basic building block for all byKey transformations. It is worth nothing that there is no such class as PairRDD or PairwiseRDD and these names are purely informal.

There are also a few commonly used subclasses of the RDD which are not a part of the public API and such are not listed above. Some examples worth mentioning are ParallelCollectionRDD and MapPartitionsRDD.

RDD is an abstract class which doesn't implement two important methods:

compute which computes result for a given partition
getPartitions which return a sequence of partitions for a given RDD

In general there are two reasons to subclass RDD;

create a class representing input source (e.g ParallelCollectionRDD, JdbcRDD)
create an RDD which provides non standard transformations

So to summarize:

RDD class provides a minimal interface for RDDs.
subclasses of RDD provide internal logic required for actual computations based on external sources and / or parent RDDs. These are either private or part of the developer API and, excluding debug strings or Spark UI, are not exposed directly to the final user.
type classes provide additional methods based on the type of the values which are stored in the RDD and not dependent on how it has been created.

Why are there different RDDs and what are their respective purposes?

Question

What I understood so far

What I suspect

1 Answers