Hive partitions to Spark partitions

Question

We need to work on a big dataset with partitioned data, for efficiency reasons. Data source resides in Hive, but with a different partition criteria. In other words, we need to retrieve data from Hive to Spark, and re-partition in Spark.

But there is an issue in Spark that causes reordering/redistributing partitioning when data is persisted (either to parquet or ORC). Therefore, our new partitioning in Spark is lost.

As an alternative, we are considering building our new partitioning in a new Hive table. The question is: is it possible to map Spark partitions from Hive partitions (for read)?

I am afraid this approach is not consistent with this: In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions. , according to this other answer — peleitor
spark.apache.org/docs/latest/…. In the next week I will re-try this in case something has changed. — thebluephantom

thebluephantom thebluephantom · Accepted Answer · 2018-07-30T20:58:57

Partition Discovery --> might be what you are looking for:

" Passing the path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths. "

Hive partitions to Spark partitions

1 Answers