0
votes

I am currently working on a Java MapReduce job, which should output data to a bucketed Hive table.

I think of two approaches:

First directly write to Hive via HCatalog. The problem is, that this approach does not support writing to a bucketed Hive table. Hence, when using a bucketed Hive table, I need to first write to a non-bucketed table and then copy it to the bucketed one.

The second option is to write the output to a text file and load this data into Hive afterwards.

What is the best practice here?

Which approach is more performant with a huge amount of data (with respect to memory and time taken)?

Which approach would be the better one, if I could also use non-bucketed Hive tables?

Thanks a lot!

1
Not sure if i understood correctly. if need is to create bucket output then use mapreduce multiple output format to create buckets. Or directly load data to hive bucket table which will internally create buckets.Rahul Sharma

1 Answers

1
votes

For non-bucketed tables, you can store your MapReduce output in the table storage location. Then you'd only need to run MSCK REPAIR TABLE to update the metadata with the new partitions.

Hive's load command actually just copies the data to the table storage location.

Also, from HIVE documentation:

The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.

So you'd need to tweak your MapReduce output to fit these constrains.