6
votes

Which One is faster? Spark SQL with Where clause or Use of Filter in Dataframe after Spark SQL?

Like Select col1, col2 from tab 1 where col1=val;

Or

dataframe df=sqlContext.sql(Select col1, col2 from tab 1);

df.filter("Col1=Val");

1

1 Answers

20
votes

Using explain method to see the physical plan is a good way to determine performance.

For example, the Zeppelin Tutorial notebook.

sqlContext.sql("select age, job from bank").filter("age = 30").explain

And

sqlContext.sql("select age, job from bank where age = 30").explain

Has exactly the same physical plan.

== Physical Plan ==
Project [age#5,job#6]
+- Filter (age#5 = 30)
   +- Scan ExistingRDD[age#5,job#6,marital#7,education#8,balance#9]

So the performance shall be the same.

Through I think select age, job from bank where age = 30 is more readable in this case.