How DataFrame API depends on RDDs in Spark?

Question

Some sources, like this Keynote: Spark 2.0 talk by Mathei Zaharia, mention that Spark DataFrames are built on top of RDDs. I have found some mentions on RDDs in the DataFrame class (in Spark 2.0 I'd have to look at DataSet); but still I have very limited understanding of how these two APIs are bound together behind the scenes.

Can someone explain how DataFrames extend RDDs if they do?

@UmbertoGriffo Hi! :( the slide was deleted... Any other link to it? Thank you — Farah

shapiy shapiy · Accepted Answer · 2016-08-31T05:17:02

According to the DataBricks article Deep Dive into Spark SQL’s Catalyst Optimizer (see Using Catalyst in Spark SQL), RDDs are elements of a Physical Plan built by Catalyst. So, we describe queries in terms of DataFrames, but in the end, Spark operates on RDDs.

Also, you can view the Physical plan of your query by using EXPLAIN instruction.

//  Prints the physical plan to the console for debugging purpose
auction.select("auctionid").distinct.explain()

// == Physical Plan ==
// Distinct false
// Exchange (HashPartitioning [auctionid#0], 200)
//  Distinct true
//   Project [auctionid#0]
 //   PhysicalRDD   //[auctionid#0,bid#1,bidtime#2,bidder#3,bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37

How DataFrame API depends on RDDs in Spark?

1 Answers