Which One is faster? Spark SQL with Where clause or Use of Filter in Dataframe after Spark SQL?
Like Select col1, col2 from tab 1 where col1=val;
Or
dataframe df=sqlContext.sql(Select col1, col2 from tab 1);
df.filter("Col1=Val");
Using explain
method to see the physical plan is a good way to determine performance.
For example, the Zeppelin Tutorial notebook.
sqlContext.sql("select age, job from bank").filter("age = 30").explain
And
sqlContext.sql("select age, job from bank where age = 30").explain
Has exactly the same physical plan.
== Physical Plan ==
Project [age#5,job#6]
+- Filter (age#5 = 30)
+- Scan ExistingRDD[age#5,job#6,marital#7,education#8,balance#9]
So the performance shall be the same.
Through I think select age, job from bank where age = 30
is more readable in this case.