Retrieve million records in batches from Apache Hive - Java 8 + Spring Boot + Hive 1.2.1 version

Question

I need to retrieve 10 million rows from Hive.

String sql = select * from table_name

List<Map<String, Object>> resultSet = jdbctemplate.queryForList(String sql)

The above method runs well to retrieve 1 million rows at once(single hit) with 2GB of Heap Memory. It takes 3-4 minutes only to select records from a table size of 30 MB(1 million rows).

But for more than 1 million records, there are memory issues, and takes more time.

I need to query Hive with OFFSET values, but for the 1.2.1 version, there is no OFFSET clause it seems.

Is there any other way to select records from Hive as Batch? Select the first 10K records and the next 10K like that?

You do not care about performance ? 1 million rows in once ? 🙄🙄 — dm_tr
@dm_tr yes performance is important, but before that, I need to retrieve all the records in the worst case without issues. — saravanan
Simple programmers care about their code (They just need it to run). Good programmers care about data structure, performance and smoothness. Cheer up captain — dm_tr

leftjoin leftjoin · Accepted Answer · 2021-01-04T11:10:59

If you have some Primary Key candidate (can be a list of columns) which can be used in order by then you can use row_number():

select --column list here
from (
    select t.*, row_number() OVER (ORDER by PK) as rn --use PK in order by
    from table_name t
    ) s
where rn between 1000001 and 2000000

Just check that your PK candidate is unique and not null, because if PK is not unique or can be null then row_number may have non-deterministic behavior and may produce different results from run to run.

And If you do not have PK, this functionality can not be implemented because Hive may return differently ordered rows due to parallel execution, this will result in duplication of rows across batches, some rows can be lost.

Retrieve million records in batches from Apache Hive - Java 8 + Spring Boot + Hive 1.2.1 version

2 Answers