There are a lot of RDDs in Spark; from the docs:
- AsyncRDDActions
- CoGroupedRDD
- DoubleRDDFunctions
- HadoopRDD
- JdbcRDD
- NewHadoopRDD
- OrderedRDDFunctions
- PairRDDFunctions
- PartitionPruningRDD
- RDD
- SequenceFileRDDFunctions
- ShuffledRDD
- UnionRDD
and I do not understand what they are supposed to be.
Additionally I noticed that there are
ParallelCollectionRDDMapPartitionsRDD
which are not listed though they appear very often in my spark-shell as objects.
Question
Why are there different RDDs and what are their respective purposes?
What I understood so far
I understood from tutorials and books (e.g. "Learning Spark") that there are two types of operations on RDDs: Those for RDDs which have pairs (x, y) and all the other operations. So I would expect to have class RDD and PairRDD and that's it.
What I suspect
I suspect that I got it partly wrong and what is actually the case is that a lot of RDD classes could be just one RDD class - but that would make things less tidy. So instead, the developers decided to put different methods into different classes and in order to provide those to any RDD class type, they use implicit to coerce between the class types. I suspect that due to the fact that many of the RDD class types end with "Functions" or "Actions" and text in the respective scaladocs sound like this.
Additionally I suspect that some of the RDD classes still are not like that, but have some more in-depth meaning (e.g. ShuffledRDD).
However - I am not sure about any of this.