1
votes

The current raw data is on Hive. I want to do a join of several partitioned terabytes Hive tables, and then output the result as a partitioned Hive table in Parquet format.

I am considering to load all partitions of Hive tables as Spark dataframes. And then do join, group by, and etc. Is this the right way to do?

Finally I will need to save the data, can we save Spark dataframe as a dynamic partitioned Hive table in Parquet format? How to deal with the metadata?

1

1 Answers

1
votes
  • If one of the several data set is sufficiently smaller than the other, you may want to consider using Broadcast for data transfer efficiency.

  • Depending on the nature of the data, you could try group by, then join. So each machine only need to process a specific set of data, reduce the amount of data transferred during task run.

  • Hive supports storing data into Parquet format directly. https://cwiki.apache.org/confluence/display/Hive/Parquet. Have you given a try?