0
votes

We need to work on a big dataset with partitioned data, for efficiency reasons. Data source resides in Hive, but with a different partition criteria. In other words, we need to retrieve data from Hive to Spark, and re-partition in Spark.

But there is an issue in Spark that causes reordering/redistributing partitioning when data is persisted (either to parquet or ORC). Therefore, our new partitioning in Spark is lost.

As an alternative, we are considering building our new partitioning in a new Hive table. The question is: is it possible to map Spark partitions from Hive partitions (for read)?

1
So, did you look? I think this can help you. - thebluephantom
I am afraid this approach is not consistent with this: In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions. , according to this other answer - peleitor
Well, maybe you should try it yourself? Did you? - thebluephantom
spark.apache.org/docs/latest/…. In the next week I will re-try this in case something has changed. - thebluephantom

1 Answers

1
votes

Partition Discovery --> might be what you are looking for:

" Passing the path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths. "