I have a query in Spark SQL which is using broadcast join as expected as my table b is smaller than spark.sql.autoBroadcastJoinThreshold.
However, if I put the exact same select query into an CTAS query then it's NOT using broadcast join for some reason.
The select query looks like this:
select id,name from a join b on a.name = b.bname;
And the explain for this looks this:
== Physical Plan ==
Project [id#1,name#2]
+- BroadcastHashJoin [name#2], [bname#3], BuildRight
:- Scan ParquetRelation: default.a[id#1,name#2] InputPaths: ...
+- ConvertToUnsafe
+- HiveTableScan [bname#3], MetastoreRelation default, b, Some(b)
Then my CTAS looks like this:
create table c as select id,name from a join b on a.name = b.bname;
And the explain for this one returns:
== Physical Plan ==
ExecutedCommand CreateTableAsSelect [Database:default}, TableName: c, InsertIntoHiveTable]
+- Project [id#1,name#2]
+- Join Inner, Some((name#2 = bname#3))
:- Relation[id#1,name#2] ParquetRelation: default.a
+- MetastoreRelation default, b, Some(b)
Is it expected to NOT use broadcast join for the select query that's part of a CTAS query? If not, is there a way to force CTAS to use broadcast join?