I have a large Hive table partitioned by date, and I'm trying to setup an Oozie workflow that runs a process on the most recent partition. Each time the ETL job runs a new folder is created. The directory structure looks like this:
/user/hive/warehouse/my_transactions/date=20150424
[...]
/user/hive/warehouse/my_transactions/date=20150811
/user/hive/warehouse/my_transactions/date=20150812
/user/hive/warehouse/my_transactions/date=20150813
On my home/lab cluster, running Hive 1.1.0-cdh5.4.4, I'm able to use the max
aggregate function in a subquery to filter for the most recent days data:
select
[...]
from my_transactions
inner join (select max(date) as max_date from my_transactions) max_date
on date = max_date
The results are returned pretty quickly.
In our work environment, running Hive 0.13.0-mapr-1501 on a much larger dataset with beefier hardware, the same query attempts to execute in multiple stages and eventually throws a java.lang.OutOfMemoryError: Java heap space
.
If I replace the subquery with a literal, i.e. WHERE date = '20150813'
instead of the aggregation and the inner join, it executes really quickly. With the aggregation/subquery it seems like, instead of using the partitions to reduce the amount of IO, it's attempting to scan all the partitions.
Is there a better way to write this query (e.g. possibly query the Hive metadata to get the max(date)
of the partitioned column)?