I am currently working on a Java MapReduce job, which should output data to a bucketed Hive table.
I think of two approaches:
First directly write to Hive via HCatalog. The problem is, that this approach does not support writing to a bucketed Hive table. Hence, when using a bucketed Hive table, I need to first write to a non-bucketed table and then copy it to the bucketed one.
The second option is to write the output to a text file and load this data into Hive afterwards.
What is the best practice here?
Which approach is more performant with a huge amount of data (with respect to memory and time taken)?
Which approach would be the better one, if I could also use non-bucketed Hive tables?
Thanks a lot!