Have to perform incremental load into an internal table from an external table in hive when the source data file is appended with new records, on a daily basis. The new records can be filtered out based on the timestamp(column load_ts in the table) at which they were loaded. Trying to achieve this by selecting the records from source table whose load_ts is greater than the current max(load_ts) in the target table as given below:
INSERT INTO TABLE target_temp PARTITION (DATA_DT)
SELECT ms.* FROM temp_db.source_temp ms
JOIN (select max(load_ts) max_load_ts from target_temp) mt
ON 1=1
WHERE
ms.load_ts > mt.max_load_ts;
But the above query does not give the desired output. Takes very long time for execution (should not be the case with Map-Reduce paradigm).
Tried other scenarios also like passing the max(load_ts) as a variable, instead of joining. Still no improvement in the performance. Would be very helpful if anyone can give their insights as to what is possibly incorrect in this approach, with any alternate solutions.