1
votes

I have two tables both are clustered on the same columns but while joining both the tables on clustered columns the execution plan shows both exchange and sort steps.

Both tables are bucketed on the same columns (key_column). Both the tables are or compressed and table A is partitioned and bucketed and table B is bucketed on the same column.

I want to avoid sort and exchange step from my plan and as per the documentation the bucketed table should have avoided both sort and exchange step.

I even tried with below hive properties:

spark.sql('set spark.sql.orc.filterPushdown=true')
spark.sql('set hive.optimize.bucketmapjoin = true')
spark.sql('set hive.optimize.bucketmapjoin.sortedmerge = true')
spark.sql('set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat')
spark.sql('set hive.optimize.bucketmapjoin = true')
spark.sql('set hive.stats.autogather=true')
spark.sql('set hive.compute.query.using.stats=true')
spark.sql('set hive.optimize.index.filter=true')

also collected stats for the tables:

Both Sort and Exchange can be seen in physical plan but hive bucketed tables should have avoided the sort and exchange step

 [count#1311L])
          +- *Project
             +- *SortMergeJoin [key_column#1079], [key_column#1218],Inner
sort step:                :- *Sort [key_column#1079 ASC NULLS FIRST], false, 0
    exchange step:            :  +- Exchange hashpartitioning(key_column#1079, 200)
                :     +- *Filter isnotnull(key_column#1079)

Expected Result: NO SORT and EXCHANGE

    [count#1311L])
              +- *Project
                 +- *SortMergeJoin [key_column#1079], [key_column#1218], Inner
                    :     +- *Filter isnotnull(key_column#1079)`enter code here`

I want to avoid sort and exchange step from my plan and as per the documentation the bucketed table should have avoided both sort and exchange step.

1
do you have the same number of buckets on each table ? - Steven
yes @steven we have same no of buckets: - Bhaskar Sharma

1 Answers

0
votes

The bucketing semantics of Hive and Spark are different.
When a bucketed table created from Hive is read in Spark, the Hive bucketing semantics are not honoured.
To utilize the Spark bucketing feature, the tables have to created using Spark.

Design doc from Open source elaborating the differences between Hive and Spark bucketing: https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit#heading=h.fbzz4lt51r0