1
votes

Some sources, like this Keynote: Spark 2.0 talk by Mathei Zaharia, mention that Spark DataFrames are built on top of RDDs. I have found some mentions on RDDs in the DataFrame class (in Spark 2.0 I'd have to look at DataSet); but still I have very limited understanding of how these two APIs are bound together behind the scenes.

Can someone explain how DataFrames extend RDDs if they do?

1
I highly recommend you read this slidesUmberto Griffo
@Umberto, which particular slides do you refer to?shapiy
from slide 60 "Dataframe: Under The Hood"Umberto Griffo
@Umberto, thanks! That explains everything very clearly.shapiy
@UmbertoGriffo Hi! :( the slide was deleted... Any other link to it? Thank youFarah

1 Answers

4
votes

According to the DataBricks article Deep Dive into Spark SQL’s Catalyst Optimizer (see Using Catalyst in Spark SQL), RDDs are elements of a Physical Plan built by Catalyst. So, we describe queries in terms of DataFrames, but in the end, Spark operates on RDDs.

Catalyst workflow

Also, you can view the Physical plan of your query by using EXPLAIN instruction.

//  Prints the physical plan to the console for debugging purpose
auction.select("auctionid").distinct.explain()

// == Physical Plan ==
// Distinct false
// Exchange (HashPartitioning [auctionid#0], 200)
//  Distinct true
//   Project [auctionid#0]
 //   PhysicalRDD   //[auctionid#0,bid#1,bidtime#2,bidder#3,bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37