7
votes

Spark 2.2 introduced cost-based optimization (CBO, https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html) which makes use of table statistics (as computed by ANALYZE TABLE COMPUTE STATISTICS....)

My question is: Are precomputed statistics also useful prior to Spark 2.2 (in my case 2.1) operating on (external hive) tables? Do statistics influence the optimizer? If yes, can I also compute the statistics in Impala instead of Hive?

UPDATE:

The only hint I have found so far is https://issues.apache.org/jira/browse/SPARK-15365

Apparently statistics are used to decide whether a broadcast-join is done are not

1
This is in fact a very interesting question, unfortunately I don't think it's fit for SO. It can be quite broad to answer and more or less opinion-based. But for the sake of discussion, how many time did you read on this site "Oh yes, my job is stuck at 90% ! I don't understand why" (mainly it's a data skewness problem). So what do you do in that case ? You actually analyze your key data distributions and try to take decisions from there. (e.g maybe partitioning or repartitioning). It the case of CBO, the statistics are less complex than that. It's mainly basic summary statistics. - eliasah
Thus, like you can read in the article, specially if you zoom in on that query performance benchmark graph, CBO seems to be beneficial on "big queries". Now to answer your question, and this is of course my personal opinion, yes CBO can be useful prior to Spark 2.2 - eliasah
@eliasah I don't understand why this is opinion based? Whether Spark 2.1's optimizer will utilize the statistics or not should be easily answerable by someone who knows the internals of the optimizer. - Raphael Roth

1 Answers

0
votes

Apparently statistics are used to decide whether a broadcast-join is done are not

As you mentioned in UPDATE with no cost-based optimization turned on the table statistics (computed using ANALYZE TABLE COMPUTE STATISTICS) are only used in JoinSelection execution planning strategy that will choose BroadcastHashJoinExec or BroadcastNestedLoopJoinExec physical operators.

JoinSelection uses spark.sql.autoBroadcastJoinThreshold configuration property that is 10M by default.