Hive query optimisation

Question

Have to perform incremental load into an internal table from an external table in hive when the source data file is appended with new records, on a daily basis. The new records can be filtered out based on the timestamp(column load_ts in the table) at which they were loaded. Trying to achieve this by selecting the records from source table whose load_ts is greater than the current max(load_ts) in the target table as given below:

INSERT INTO TABLE target_temp PARTITION (DATA_DT)
SELECT ms.* FROM temp_db.source_temp ms 
JOIN (select max(load_ts) max_load_ts from target_temp) mt
ON 1=1
WHERE
ms.load_ts > mt.max_load_ts;

But the above query does not give the desired output. Takes very long time for execution (should not be the case with Map-Reduce paradigm).

Tried other scenarios also like passing the max(load_ts) as a variable, instead of joining. Still no improvement in the performance. Would be very helpful if anyone can give their insights as to what is possibly incorrect in this approach, with any alternate solutions.

Dunno how to find the size of table data. I applied select count(*) from table; and got number of records= 5493656359. Its quite huge. — Rushi Pradhan
Can you paste the log of the output generated while the query is running? You have to play with several things while handling 5 billion records. — Durga Viswanath Gadiraju
How do you store that data? Text, SequenceFile, AVRO, Parquet, ORC? Compressed? On how many nodes? — Samson Scharfrichter
@DurgaViswanathGadiraju I won't be able to get logs as I'm logging to the linux server through putty terminal from HVD. I don't have rights to transfer files to and from HVD(Hosted Virtual desktop). — Rushi Pradhan

Roberto Congiu Roberto Congiu · Accepted Answer · 2015-12-23T11:25:25

First of all, the map/reduce model does not guarantee that your queries will take less. The main idea is that its performance will scale linearly with the number of nodes, but you have to still think about how you're doing things, more so than in normal SQL.

First thing to check is if the source table is partitioned by time. If not, it should as you'd be reading the whole table every single time. Second, you're calculating the max as well every time, also, on the whole destination table. You could make it a lot faster if you just calculate the max on the last partition, so change this

JOIN (select max(load_ts) max_load_ts from target_temp) mt

to this (you didn't write the partition column so I am going to assume it's called 'dt'

JOIN (select max(load_ts) max_load_ts from target_temp WHERE dt=PREVIOUS_DATA_DT) mt

since we know the max load_ts is going to be in the last partition.

Otherwise, it's hard to help without knowing the structure of the source table, and, like somebody else commented, the sizes of the two tables.

Hive query optimisation

3 Answers