hive: filter by "max(partitioned_col)" without performing full table scan

Question

I have a large Hive table partitioned by date, and I'm trying to setup an Oozie workflow that runs a process on the most recent partition. Each time the ETL job runs a new folder is created. The directory structure looks like this:

/user/hive/warehouse/my_transactions/date=20150424
[...]
/user/hive/warehouse/my_transactions/date=20150811
/user/hive/warehouse/my_transactions/date=20150812
/user/hive/warehouse/my_transactions/date=20150813

On my home/lab cluster, running Hive 1.1.0-cdh5.4.4, I'm able to use the max aggregate function in a subquery to filter for the most recent days data:

select
  [...]
from my_transactions
inner join (select max(date) as max_date from my_transactions) max_date
on date = max_date

The results are returned pretty quickly.

In our work environment, running Hive 0.13.0-mapr-1501 on a much larger dataset with beefier hardware, the same query attempts to execute in multiple stages and eventually throws a java.lang.OutOfMemoryError: Java heap space.

If I replace the subquery with a literal, i.e. WHERE date = '20150813' instead of the aggregation and the inner join, it executes really quickly. With the aggregation/subquery it seems like, instead of using the partitions to reduce the amount of IO, it's attempting to scan all the partitions.

Is there a better way to write this query (e.g. possibly query the Hive metadata to get the max(date) of the partitioned column)?

Samson Scharfrichter Samson Scharfrichter · Accepted Answer · 2015-08-13T22:20:29

Did you run an EXPLAIN on your query, to understand how Hive tries to translate that JOIN into sub-tasks?

All right, EXPLAIN output is a dirty mess, but I suspect that it would show an awkward query plan such as...

dumping the entire left-side table into the RAM (i.e. a Java HashMap) of each Mapper
then reading sequentially the result of the right-side subquery (1 record) to match each record against the HashMap

It would be typical of a MAPJOIN in the wrong order. So, what about replacing that dummy JOIN with a more explicit WHERE IN (subquery) that is supported at last in Hive 0.13?

Bottom line: the Hive query optimizer is still a crude and vicious beast. In many cases you have to steer it to the "correct" query plan.

hive: filter by "max(partitioned_col)" without performing full table scan

2 Answers