1
votes

I need to retrieve 10 million rows from Hive.

String sql = select * from table_name

List<Map<String, Object>> resultSet = jdbctemplate.queryForList(String sql)

The above method runs well to retrieve 1 million rows at once(single hit) with 2GB of Heap Memory. It takes 3-4 minutes only to select records from a table size of 30 MB(1 million rows).

But for more than 1 million records, there are memory issues, and takes more time.

I need to query Hive with OFFSET values, but for the 1.2.1 version, there is no OFFSET clause it seems.

Is there any other way to select records from Hive as Batch? Select the first 10K records and the next 10K like that?

2
You do not care about performance ? 1 million rows in once ? 🙄🙄 - dm_tr
@dm_tr yes performance is important, but before that, I need to retrieve all the records in the worst case without issues. - saravanan
Simple programmers care about their code (They just need it to run). Good programmers care about data structure, performance and smoothness. Cheer up captain - dm_tr

2 Answers

0
votes

If you have some Primary Key candidate (can be a list of columns) which can be used in order by then you can use row_number():

select --column list here
from (
    select t.*, row_number() OVER (ORDER by PK) as rn --use PK in order by
    from table_name t
    ) s
where rn between 1000001 and 2000000

Just check that your PK candidate is unique and not null, because if PK is not unique or can be null then row_number may have non-deterministic behavior and may produce different results from run to run.

And If you do not have PK, this functionality can not be implemented because Hive may return differently ordered rows due to parallel execution, this will result in duplication of rows across batches, some rows can be lost.

0
votes

This solution should work if you have some kind of unique increasing id for each row.

get the min value and the max value of the unique id. Then select from the min in batches of 10000 will you reach the max value.

Same logic can be applied on timestamp fields where you can start from beginning of day and do till end of day in batches of 15 mins or 1 hour or whatever fits your batch requirement. This would work if the the data is not extremely skewed.

If you want to do it using jdbc templatee, u can use a RowCallbackHandler or a ResultSetExtractor as argument.

To set the number of rows fetched at once by the resultset, override applyStatementSettings() and call Statement.setFetchSize()

You can read more about it here.

https://docs.spring.io/spring-framework/docs/3.1.x/javadoc-api/org/springframework/jdbc/core/JdbcTemplate.html