Spark SQL: how does it map to RDD operations?

Question

When I learn spark SQL, I have a question in my mind:

As said, the SQL execution result is SchemaRDD, but what happens behind the scene? How many transformations or actions in the optimized execution plan, which should be equivalent to plain RDD hand-written codes invoked?

If we write codes by hand instead of SQL, it may generate some intermediate RDDs, e.g. a series of map(), filter() operations upon the source RDD. But the SQL version would not generate intermediate RDDs, correct?

Depending on the SQL content, the generated VM byte codes also involves partitioning, shuffling, correct? But without intermediate RDDs, how could spark schedule and execute them on worker machines?

In fact, I still can not understand the relationship between the spark SQL and spark core. How they interact with each other?

Sim Sim · Accepted Answer · 2016-06-12T20:03:41

To understand how SparkSQL or the dataframe/dataset DSL maps to RDD operations, look at the physical plan Spark generates using explain.

sql(/* your SQL here */).explain
myDataframe.explain

At the very core of Spark, RDD[_] is the underlying datatype that is manipulated using distributed operations. In Spark versions <= 1.6.x DataFrame is RDD[Row] and Dataset is separate. In Spark versions >= 2.x DataFrame becomes Dataset[Row]. That doesn't change the fact that underneath it all Spark uses RDD operations.

For a deeper dive into understanding Spark execution, read Understanding Spark Through Visualization.

Spark SQL: how does it map to RDD operations?

1 Answers