Trying to understand how Hive partitions relate to Spark partitions, culminating in a question about joins.
I have 2 external Hive tables; both backed by S3 buckets and partitioned by date
; so in each bucket there are keys with name format date=<yyyy-MM-dd>/<filename>
.
Question 1:
If I read this data into Spark:
val table1 = spark.table("table1").as[Table1Row]
val table2 = spark.table("table2").as[Table2Row]
then how many partitions are the resultant datasets going to have respectively? Partitions equal to the number of objects in S3?
Question 2:
Suppose the two row types have the following schema:
Table1Row(date: Date, id: String, ...)
Table2Row(date: Date, id: String, ...)
and that I want to join table1
and table2
on the fields date
and id
:
table1.joinWith(table2,
table1("date") === table2("date") &&
table1("id") === table2("id")
)
Is Spark going to be able to utilize the fact that one of the fields being joined on is the partition key in the Hive tables to optimize the join? And if so how?
Question 3:
Suppose now that I am using RDD
s instead:
val rdd1 = table1.rdd
val rdd2 = table2.rdd
AFAIK, the syntax for the join using the RDD
API would look something like:
rdd1.map(row1 => ((row1.date, row1.id), row1))
.join(rdd2.map(row2 => ((row2.date, row2.id), row2))))
Again, is Spark going to be able to utilize the fact that the partition key in the Hive tables is being used in the join?