We have a job that reads from a hive table with around 3billion rows and inserts in a sorted bucketed table.
Files in both source and destination tables are having parquet format.
This job is taking too long to finish. We have had to stop the job after 3 days.
We recently migrated to a new cluster. The older cluster was 5.12 and the latest cluster is 6.3.1. This job used to run fine and finish within 6 hours in the 5.12 cluster. However, it's taking too long in the new cluster.
We have tried the following things to solve this without any results:-
- Removed the cap on reducers. Removed set hive.exec.reducers.max=200;
- set mapreduce.job.running.reduce.limit=100;
- Merged files at the source to make sure we are not reading small files. File size in the source table was increased to 1G each.
- Reduce the no. of rows in the source table to reduce the data mappers are reading.
- Reduce the max split size to 64MB to increase the no. of mappers.
- Insert in a new table.
- Insert in a new table that is not sorted or bucketed.
The query we are trying to run :-
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=100000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.created.files=900000;
set mapreduce.input.fileinputformat.split.maxsize=64000000;
set mapreduce.job.running.reduce.limit=100;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
INSERT OVERWRITE TABLE dbname.features_archive_new PARTITION (feature, ingestmonth)
Select mpn,mfr,partnum,source,ingestdate,max(value) as value,feature,ingestmonth
from dbname.features_archive_tmp
where feature = 'price'
and ingestmonth like '20%'
group by mpn,mfr,partnum,source,ingestdate,feature,ingestmonth;