0
votes
  1. Primary Question: Does CBO(cost based optimizer) in spark applies only to spark-sql only or is it also applied for Dataframe and Dataset API?
  2. What is the difference between CBO and tungsten and Catalyst optimizer?
  3. Does knowing in detail about the above 3 topics help in improving the peformance? Can we actually take control and tune the behind internals? If yes, please guide how and also kindly share some reference links.(I find many articles explaining the concepts but seldom actually explaining on how to take advantage of this information to improve performance)
1

1 Answers

0
votes
  1. Yes, also also works for DataFrame/Dataset API

  2. Catalyst was a rule-based optimizer brior to Spark 2.2. Enabling CBO will allow Catalyst to use actual (meta)-data to decide which physical plan to chose. So I would argue that CBO is part of Catalyst

  3. No you cannot "tune", you can make use of CBO if you gather table statistics of your input tables (ANALYZE TABLE ...). Like this, catalyst will make better physical plans.