1
votes

I have a huge Hive table that MapReduce job fails to process as a result of insufficient Java heap size on a single local node installation. I can not increase YARN heap size because of the lack of physical memory on this node. As a work around I was thinking about splitting this huge table into several smaller ones of approximately equal size and the same structure (schema). Let's say 20 000 000 records into 5 tables with 4 000 000 records each.

What would be a SQL request to split a Hive table this way?

2

2 Answers

3
votes

First of all, I would dig into why you're getting heap size errors. This usually indicates a misconfigured cluster. In theory, Hive/Hadoop should be able to do almost everything by streaming to/from disk; it loads data into memory mainly as an optimization. Hive is designed to handle tables with billions of records and terabytes of data.

However, if you do actually want to evenly sample into multiple tables, you can use a multi-table insert; something like this:

from (
  select a, b, c, floor(rand() * 5) as part from my_table
) t
insert into my_table_0 select a, b, c where part = 0
insert into my_table_1 select a, b, c where part = 1
insert into my_table_2 select a, b, c where part = 2
insert into my_table_3 select a, b, c where part = 3
insert into my_table_4 select a, b, c where part = 4
1
votes

There are couple of ways to achieve your task.

  1. Split your source file into 5 parts. You can use split command for this. More details - http://unixhelp.ed.ac.uk/CGI/man-cgi?split. Once it's split, load each file into one table.

  2. If you have any date column or sequence values in your Hive table, you could use them to filter out records and thereby load them into different tables.

Also, I would suggest looking at adding appropriate indexes (https://cwiki.apache.org/confluence/.../Hive/LanguageManual+Indexing) and/or partitions (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL) to your tables to improve performance.